-
Scalable Approximate Biclique Counting over Large Bipartite Graphs
Authors:
Jingbang Chen,
Weinuo Li,
Yingli Zhou,
Hangrui Zhou,
Qiuyang Mang,
Can Wang,
Yixiang Fang,
Chenhao Ma
Abstract:
Counting $(p,q)$-bicliques in bipartite graphs is crucial for a variety of applications, from recommendation systems to cohesive subgraph analysis. Yet, it remains computationally challenging due to the combinatorial explosion to exactly count the $(p,q)$-bicliques. In many scenarios, e.g., graph kernel methods, however, exact counts are not strictly required. To design a scalable and high-quality…
▽ More
Counting $(p,q)$-bicliques in bipartite graphs is crucial for a variety of applications, from recommendation systems to cohesive subgraph analysis. Yet, it remains computationally challenging due to the combinatorial explosion to exactly count the $(p,q)$-bicliques. In many scenarios, e.g., graph kernel methods, however, exact counts are not strictly required. To design a scalable and high-quality approximate solution, we novelly resort to $(p,q)$-broom, a special spanning tree of the $(p,q)$-biclique, which can be counted via graph coloring and efficient dynamic programming. Based on the intermediate results of the dynamic programming, we propose an efficient sampling algorithm to derive the approximate $(p,q)$-biclique count from the $(p,q)$-broom counts. Theoretically, our method offers unbiased estimates with provable error guarantees. Empirically, our solution outperforms existing approximation techniques in both accuracy (up to 8$\times$ error reduction) and runtime (up to 50$\times$ speedup) on nine real-world bipartite networks, providing a scalable solution for large-scale $(p,q)$-biclique counting.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
Asynchronous Decentralized SGD under Non-Convexity: A Block-Coordinate Descent Framework
Authors:
Yijie Zhou,
Shi Pu
Abstract:
Decentralized optimization has become vital for leveraging distributed data without central control, enhancing scalability and privacy. However, practical deployments face fundamental challenges due to heterogeneous computation speeds and unpredictable communication delays. This paper introduces a refined model of Asynchronous Decentralized Stochastic Gradient Descent (ADSGD) under practical assum…
▽ More
Decentralized optimization has become vital for leveraging distributed data without central control, enhancing scalability and privacy. However, practical deployments face fundamental challenges due to heterogeneous computation speeds and unpredictable communication delays. This paper introduces a refined model of Asynchronous Decentralized Stochastic Gradient Descent (ADSGD) under practical assumptions of bounded computation and communication times. To understand the convergence of ADSGD, we first analyze Asynchronous Stochastic Block Coordinate Descent (ASBCD) as a tool, and then show that ADSGD converges under computation-delay-independent step sizes. The convergence result is established without assuming bounded data heterogeneity. Empirical experiments reveal that ADSGD outperforms existing methods in wall-clock convergence time across various scenarios. With its simplicity, efficiency in memory and communication, and resilience to communication and computation delays, ADSGD is well-suited for real-world decentralized learning tasks.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
High Quality Underwater Image Compression with Adaptive Correction and Codebook-based Augmentation
Authors:
Yimin Zhou,
Yichong Xia,
Sicheng Pan,
Bin Chen,
Baoyi An,
Haoqian Wang,
Zhi Wang,
Yaowei Wang,
Zikun Zhou
Abstract:
With the increasing exploration and exploitation of the underwater world, underwater images have become a critical medium for human interaction with marine environments, driving extensive research into their efficient transmission and storage. However, contemporary underwater image compression algorithms fail to fully leverage the unique characteristics distinguishing underwater scenes from terres…
▽ More
With the increasing exploration and exploitation of the underwater world, underwater images have become a critical medium for human interaction with marine environments, driving extensive research into their efficient transmission and storage. However, contemporary underwater image compression algorithms fail to fully leverage the unique characteristics distinguishing underwater scenes from terrestrial images, resulting in suboptimal performance. To address this limitation, we introduce HQUIC, designed to exploit underwater-image-specific features for enhanced compression efficiency. HQUIC employs an ALTC module to adaptively predict the attenuation coefficients and global light information of the images, which effectively mitigates the issues caused by the differences in lighting and tone existing in underwater images. Subsequently, HQUIC employs a codebook as an auxiliary branch to extract the common objects within underwater images and enhances the performance of the main branch. Furthermore, HQUIC dynamically weights multi-scale frequency components, prioritizing information critical for distortion quality while discarding redundant details. Extensive evaluations on diverse underwater datasets demonstrate that HQUIC outperforms state-of-the-art compression methods.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
Automated SAR ADC Sizing Using Analytical Equations
Authors:
Zhongyi Li,
Zhuofu Tao,
Yanze Zhou,
Yichen Shi,
Zhiping Yu,
Ting-Jung Lin,
Lei He
Abstract:
Conventional analog and mixed-signal (AMS) circuit designs heavily rely on manual effort, which is time-consuming and labor-intensive. This paper presents a fully automated design methodology for Successive Approximation Register (SAR) Analog-to-Digital Converters (ADCs) from performance specifications to complete transistor sizing. To tackle the high-dimensional sizing problem, we propose a dual…
▽ More
Conventional analog and mixed-signal (AMS) circuit designs heavily rely on manual effort, which is time-consuming and labor-intensive. This paper presents a fully automated design methodology for Successive Approximation Register (SAR) Analog-to-Digital Converters (ADCs) from performance specifications to complete transistor sizing. To tackle the high-dimensional sizing problem, we propose a dual optimization scheme. The system-level optimization iteratively partitions the overall requirements and analytically maps them to subcircuit design specifications, while local optimization loops determines the subcircuits' design parameters. The dependency graph-based framework serializes the simulations for verification, knowledge-based calculations, and transistor sizing optimization in topological order, which eliminates the need for human intervention. We demonstrate the effectiveness of the proposed methodology through two case studies with varying performance specifications, achieving high SNDR and low power consumption while meeting all the specified design constraints.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
A Comparative Review of RNA Language Models
Authors:
He Wang,
Yikun Zhang,
Jie Chen,
Jian Zhan,
Yaoqi Zhou
Abstract:
Given usefulness of protein language models (LMs) in structure and functional inference, RNA LMs have received increased attentions in the last few years. However, these RNA models are often not compared against the same standard. Here, we divided RNA LMs into three classes (pretrained on multiple RNA types (especially noncoding RNAs), specific-purpose RNAs, and LMs that unify RNA with DNA or prot…
▽ More
Given usefulness of protein language models (LMs) in structure and functional inference, RNA LMs have received increased attentions in the last few years. However, these RNA models are often not compared against the same standard. Here, we divided RNA LMs into three classes (pretrained on multiple RNA types (especially noncoding RNAs), specific-purpose RNAs, and LMs that unify RNA with DNA or proteins or both) and compared 13 RNA LMs along with 3 DNA and 1 protein LMs as controls in zero-shot prediction of RNA secondary structure and functional classification. Results shows that the models doing well on secondary structure prediction often perform worse in function classification or vice versa, suggesting that more balanced unsupervised training is needed.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Deployable and Generalizable Motion Prediction: Taxonomy, Open Challenges and Future Directions
Authors:
Letian Wang,
Marc-Antoine Lavoie,
Sandro Papais,
Barza Nisar,
Yuxiao Chen,
Wenhao Ding,
Boris Ivanovic,
Hao Shao,
Abulikemu Abuduweili,
Evan Cook,
Yang Zhou,
Peter Karkus,
Jiachen Li,
Changliu Liu,
Marco Pavone,
Steven Waslander
Abstract:
Motion prediction, the anticipation of future agent states or scene evolution, is rooted in human cognition, bridging perception and decision-making. It enables intelligent systems, such as robots and self-driving cars, to act safely in dynamic, human-involved environments, and informs broader time-series reasoning challenges. With advances in methods, representations, and datasets, the field has…
▽ More
Motion prediction, the anticipation of future agent states or scene evolution, is rooted in human cognition, bridging perception and decision-making. It enables intelligent systems, such as robots and self-driving cars, to act safely in dynamic, human-involved environments, and informs broader time-series reasoning challenges. With advances in methods, representations, and datasets, the field has seen rapid progress, reflected in quickly evolving benchmark results. Yet, when state-of-the-art methods are deployed in the real world, they often struggle to generalize to open-world conditions and fall short of deployment standards. This reveals a gap between research benchmarks, which are often idealized or ill-posed, and real-world complexity.
To address this gap, this survey revisits the generalization and deployability of motion prediction models, with an emphasis on the applications of robotics, autonomous driving, and human motion. We first offer a comprehensive taxonomy of motion prediction methods, covering representations, modeling strategies, application domains, and evaluation protocols. We then study two key challenges: (1) how to push motion prediction models to be deployable to realistic deployment standards, where motion prediction does not act in a vacuum, but functions as one module of closed-loop autonomy stacks - it takes input from the localization and perception, and informs downstream planning and control. 2) how to generalize motion prediction models from limited seen scenarios/datasets to the open-world settings. Throughout the paper, we highlight critical open challenges to guide future work, aiming to recalibrate the community's efforts, fostering progress that is not only measurable but also meaningful for real-world applications.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Toward Cost-Efficient Serving of Mixture-of-Experts with Asynchrony
Authors:
Shaoyu Wang,
Guangrong He,
Geon-Woo Kim,
Yanqi Zhou,
Seo Jin Park
Abstract:
Mixture-of-Experts (MoE) architectures offer the promise of larger model capacity without the prohibitive costs of fully dense designs. However, in real-world inference serving, load skew across experts often leads to suboptimal device utilization and excessive synchronization overheads. This paper introduces Asynchronous Expert Parallelism (AEP), a new paradigm that decouples layer execution from…
▽ More
Mixture-of-Experts (MoE) architectures offer the promise of larger model capacity without the prohibitive costs of fully dense designs. However, in real-world inference serving, load skew across experts often leads to suboptimal device utilization and excessive synchronization overheads. This paper introduces Asynchronous Expert Parallelism (AEP), a new paradigm that decouples layer execution from barrier-style synchronization. By dynamically queuing tokens at each layer (referred to as $μ$-queuing) and adaptively re-batching them on demand, GPUs avoid waiting for straggling experts and instead continuously process whichever layer is ready. This asynchronous approach mitigates two major inefficiencies in traditional expert-parallel systems: (1) idle GPU time while waiting for the hottest expert, and (2) small-batch executions on colder experts that waste memory bandwidth.
We implement these ideas in a serving system called AMoE, which disaggregates attention from expert layers and uses a defragging scheduler to reduce batch fragmentation. Evaluations on prototype MoE models show that AMoE improves throughput by up to 2.7x compared to state-of-the-art baselines, incurring a manageable latency penalty and providing a cost-effective operating point. Furthermore, experiments demonstrate nearly linear scalability to multi-node settings, whereas the baseline system shows no throughput increase even when the number of GPUs is doubled.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Generative AI for Autonomous Driving: Frontiers and Opportunities
Authors:
Yuping Wang,
Shuo Xing,
Cui Can,
Renjie Li,
Hongyuan Hua,
Kexin Tian,
Zhaobin Mo,
Xiangbo Gao,
Keshu Wu,
Sulong Zhou,
Hengxu You,
Juntong Peng,
Junge Zhang,
Zehao Wang,
Rui Song,
Mingxuan Yan,
Walter Zimmer,
Xingcheng Zhou,
Peiran Li,
Zhaohan Lu,
Chia-Ju Chen,
Yue Huang,
Ryan A. Rossi,
Lichao Sun,
Hongkai Yu
, et al. (22 additional authors not shown)
Abstract:
Generative Artificial Intelligence (GenAI) constitutes a transformative technological wave that reconfigures industries through its unparalleled capabilities for content creation, reasoning, planning, and multimodal understanding. This revolutionary force offers the most promising path yet toward solving one of engineering's grandest challenges: achieving reliable, fully autonomous driving, partic…
▽ More
Generative Artificial Intelligence (GenAI) constitutes a transformative technological wave that reconfigures industries through its unparalleled capabilities for content creation, reasoning, planning, and multimodal understanding. This revolutionary force offers the most promising path yet toward solving one of engineering's grandest challenges: achieving reliable, fully autonomous driving, particularly the pursuit of Level 5 autonomy. This survey delivers a comprehensive and critical synthesis of the emerging role of GenAI across the autonomous driving stack. We begin by distilling the principles and trade-offs of modern generative modeling, encompassing VAEs, GANs, Diffusion Models, and Large Language Models (LLMs). We then map their frontier applications in image, LiDAR, trajectory, occupancy, video generation as well as LLM-guided reasoning and decision making. We categorize practical applications, such as synthetic data workflows, end-to-end driving strategies, high-fidelity digital twin systems, smart transportation networks, and cross-domain transfer to embodied AI. We identify key obstacles and possibilities such as comprehensive generalization across rare cases, evaluation and safety checks, budget-limited implementation, regulatory compliance, ethical concerns, and environmental effects, while proposing research plans across theoretical assurances, trust metrics, transport integration, and socio-technical influence. By unifying these threads, the survey provides a forward-looking reference for researchers, engineers, and policymakers navigating the convergence of generative AI and advanced autonomous mobility. An actively maintained repository of cited works is available at https://github.com/taco-group/GenAI4AD.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
A Survey of 3D Reconstruction with Event Cameras: From Event-based Geometry to Neural 3D Rendering
Authors:
Chuanzhi Xu,
Haoxian Zhou,
Langyi Chen,
Haodong Chen,
Ying Zhou,
Vera Chung,
Qiang Qu
Abstract:
Event cameras have emerged as promising sensors for 3D reconstruction due to their ability to capture per-pixel brightness changes asynchronously. Unlike conventional frame-based cameras, they produce sparse and temporally rich data streams, which enable more accurate 3D reconstruction and open up the possibility of performing reconstruction in extreme environments such as high-speed motion, low l…
▽ More
Event cameras have emerged as promising sensors for 3D reconstruction due to their ability to capture per-pixel brightness changes asynchronously. Unlike conventional frame-based cameras, they produce sparse and temporally rich data streams, which enable more accurate 3D reconstruction and open up the possibility of performing reconstruction in extreme environments such as high-speed motion, low light, or high dynamic range scenes. In this survey, we provide the first comprehensive review focused exclusively on 3D reconstruction using event cameras. The survey categorises existing works into three major types based on input modality - stereo, monocular, and multimodal systems, and further classifies them by reconstruction approach, including geometry-based, deep learning-based, and recent neural rendering techniques such as Neural Radiance Fields and 3D Gaussian Splatting. Methods with a similar research focus were organised chronologically into the most subdivided groups. We also summarise public datasets relevant to event-based 3D reconstruction. Finally, we highlight current research limitations in data availability, evaluation, representation, and dynamic scene handling, and outline promising future research directions. This survey aims to serve as a comprehensive reference and a roadmap for future developments in event-driven 3D reconstruction.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
An integrated language-vision foundation model for conversational diagnostics and triaging in primary eye care
Authors:
Zhi Da Soh,
Yang Bai,
Kai Yu,
Yang Zhou,
Xiaofeng Lei,
Sahil Thakur,
Zann Lee,
Lee Ching Linette Phang,
Qingsheng Peng,
Can Can Xue,
Rachel Shujuan Chong,
Quan V. Hoang,
Lavanya Raghavan,
Yih Chung Tham,
Charumathi Sabanayagam,
Wei-Chi Wu,
Ming-Chih Ho,
Jiangnan He,
Preeti Gupta,
Ecosse Lamoureux,
Seang Mei Saw,
Vinay Nangia,
Songhomitra Panda-Jonas,
Jie Xu,
Ya Xing Wang
, et al. (6 additional authors not shown)
Abstract:
Current deep learning models are mostly task specific and lack a user-friendly interface to operate. We present Meta-EyeFM, a multi-function foundation model that integrates a large language model (LLM) with vision foundation models (VFMs) for ocular disease assessment. Meta-EyeFM leverages a routing mechanism to enable accurate task-specific analysis based on text queries. Using Low Rank Adaptati…
▽ More
Current deep learning models are mostly task specific and lack a user-friendly interface to operate. We present Meta-EyeFM, a multi-function foundation model that integrates a large language model (LLM) with vision foundation models (VFMs) for ocular disease assessment. Meta-EyeFM leverages a routing mechanism to enable accurate task-specific analysis based on text queries. Using Low Rank Adaptation, we fine-tuned our VFMs to detect ocular and systemic diseases, differentiate ocular disease severity, and identify common ocular signs. The model achieved 100% accuracy in routing fundus images to appropriate VFMs, which achieved $\ge$ 82.2% accuracy in disease detection, $\ge$ 89% in severity differentiation, $\ge$ 76% in sign identification. Meta-EyeFM was 11% to 43% more accurate than Gemini-1.5-flash and ChatGPT-4o LMMs in detecting various eye diseases and comparable to an ophthalmologist. This system offers enhanced usability and diagnostic performance, making it a valuable decision support tool for primary eye care or an online LLM for fundus evaluation.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder
Authors:
Bowen Zhang,
Congchao Guo,
Geng Yang,
Hang Yu,
Haozhe Zhang,
Heidi Lei,
Jialong Mai,
Junjie Yan,
Kaiyue Yang,
Mingqi Yang,
Peikai Huang,
Ruiyang Jin,
Sitan Jiang,
Weihua Cheng,
Yawei Li,
Yichen Xiao,
Yiying Zhou,
Yongmao Zhang,
Yuan Lu,
Yucen He
Abstract:
We introduce MiniMax-Speech, an autoregressive Transformer-based Text-to-Speech (TTS) model that generates high-quality speech. A key innovation is our learnable speaker encoder, which extracts timbre features from a reference audio without requiring its transcription. This enables MiniMax-Speech to produce highly expressive speech with timbre consistent with the reference in a zero-shot manner, w…
▽ More
We introduce MiniMax-Speech, an autoregressive Transformer-based Text-to-Speech (TTS) model that generates high-quality speech. A key innovation is our learnable speaker encoder, which extracts timbre features from a reference audio without requiring its transcription. This enables MiniMax-Speech to produce highly expressive speech with timbre consistent with the reference in a zero-shot manner, while also supporting one-shot voice cloning with exceptionally high similarity to the reference voice. In addition, the overall quality of the synthesized audio is enhanced through the proposed Flow-VAE. Our model supports 32 languages and demonstrates excellent performance across multiple objective and subjective evaluations metrics. Notably, it achieves state-of-the-art (SOTA) results on objective voice cloning metrics (Word Error Rate and Speaker Similarity) and has secured the top position on the public TTS Arena leaderboard. Another key strength of MiniMax-Speech, granted by the robust and disentangled representations from the speaker encoder, is its extensibility without modifying the base model, enabling various applications such as: arbitrary voice emotion control via LoRA; text to voice (T2V) by synthesizing timbre features directly from text description; and professional voice cloning (PVC) by fine-tuning timbre features with additional data. We encourage readers to visit https://minimax-ai.github.io/tts_tech_report for more examples.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
A Tale of Two Identities: An Ethical Audit of Human and AI-Crafted Personas
Authors:
Pranav Narayanan Venkit,
Jiayi Li,
Yingfan Zhou,
Sarah Rajtmajer,
Shomir Wilson
Abstract:
As LLMs (large language models) are increasingly used to generate synthetic personas particularly in data-limited domains such as health, privacy, and HCI, it becomes necessary to understand how these narratives represent identity, especially that of minority communities. In this paper, we audit synthetic personas generated by 3 LLMs (GPT4o, Gemini 1.5 Pro, Deepseek 2.5) through the lens of repres…
▽ More
As LLMs (large language models) are increasingly used to generate synthetic personas particularly in data-limited domains such as health, privacy, and HCI, it becomes necessary to understand how these narratives represent identity, especially that of minority communities. In this paper, we audit synthetic personas generated by 3 LLMs (GPT4o, Gemini 1.5 Pro, Deepseek 2.5) through the lens of representational harm, focusing specifically on racial identity. Using a mixed methods approach combining close reading, lexical analysis, and a parameterized creativity framework, we compare 1512 LLM generated personas to human-authored responses. Our findings reveal that LLMs disproportionately foreground racial markers, overproduce culturally coded language, and construct personas that are syntactically elaborate yet narratively reductive. These patterns result in a range of sociotechnical harms, including stereotyping, exoticism, erasure, and benevolent bias, that are often obfuscated by superficially positive narrations. We formalize this phenomenon as algorithmic othering, where minoritized identities are rendered hypervisible but less authentic. Based on these findings, we offer design recommendations for narrative-aware evaluation metrics and community-centered validation protocols for synthetic identity generation.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
SweRank: Software Issue Localization with Code Ranking
Authors:
Revanth Gangi Reddy,
Tarun Suresh,
JaeHyeok Doo,
Ye Liu,
Xuan Phi Nguyen,
Yingbo Zhou,
Semih Yavuz,
Caiming Xiong,
Heng Ji,
Shafiq Joty
Abstract:
Software issue localization, the task of identifying the precise code locations (files, classes, or functions) relevant to a natural language issue description (e.g., bug report, feature request), is a critical yet time-consuming aspect of software development. While recent LLM-based agentic approaches demonstrate promise, they often incur significant latency and cost due to complex multi-step rea…
▽ More
Software issue localization, the task of identifying the precise code locations (files, classes, or functions) relevant to a natural language issue description (e.g., bug report, feature request), is a critical yet time-consuming aspect of software development. While recent LLM-based agentic approaches demonstrate promise, they often incur significant latency and cost due to complex multi-step reasoning and relying on closed-source LLMs. Alternatively, traditional code ranking models, typically optimized for query-to-code or code-to-code retrieval, struggle with the verbose and failure-descriptive nature of issue localization queries. To bridge this gap, we introduce SweRank, an efficient and effective retrieve-and-rerank framework for software issue localization. To facilitate training, we construct SweLoc, a large-scale dataset curated from public GitHub repositories, featuring real-world issue descriptions paired with corresponding code modifications. Empirical results on SWE-Bench-Lite and LocBench show that SweRank achieves state-of-the-art performance, outperforming both prior ranking models and costly agent-based systems using closed-source LLMs like Claude-3.5. Further, we demonstrate SweLoc's utility in enhancing various existing retriever and reranker models for issue localization, establishing the dataset as a valuable resource for the community.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
Diffusion-driven SpatioTemporal Graph KANsformer for Medical Examination Recommendation
Authors:
Jianan Li,
Yangtao Zhou,
Zhifu Zhao,
Qinglan Huang,
Jian Qi,
Xiao He,
Hua Chu,
Fu Li
Abstract:
Recommendation systems in AI-based medical diagnostics and treatment constitute a critical component of AI in healthcare. Although some studies have explored this area and made notable progress, healthcare recommendation systems remain in their nascent stage. And these researches mainly target the treatment process such as drug or disease recommendations. In addition to the treatment process, the…
▽ More
Recommendation systems in AI-based medical diagnostics and treatment constitute a critical component of AI in healthcare. Although some studies have explored this area and made notable progress, healthcare recommendation systems remain in their nascent stage. And these researches mainly target the treatment process such as drug or disease recommendations. In addition to the treatment process, the diagnostic process, particularly determining which medical examinations are necessary to evaluate the condition, also urgently requires intelligent decision support. To bridge this gap, we first formalize the task of medical examination recommendations. Compared to traditional recommendations, the medical examination recommendation involves more complex interactions. This complexity arises from two folds: 1) The historical medical records for examination recommendations are heterogeneous and redundant, which makes the recommendation results susceptible to noise. 2) The correlation between the medical history of patients is often irregular, making it challenging to model spatiotemporal dependencies. Motivated by the above observation, we propose a novel Diffusion-driven SpatioTemporal Graph KANsformer for Medical Examination Recommendation (DST-GKAN) with a two-stage learning paradigm to solve the above challenges. In the first stage, we exploit a task-adaptive diffusion model to distill recommendation-oriented information by reducing the noises in heterogeneous medical data. In the second stage, a spatiotemporal graph KANsformer is proposed to simultaneously model the complex spatial and temporal relationships. Moreover, to facilitate the medical examination recommendation research, we introduce a comprehensive dataset. The experimental results demonstrate the state-of-the-art performance of the proposed method compared to various competitive baselines.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning
Authors:
Xiaokun Wang,
Chris,
Jiangbo Pei,
Wei Shen,
Yi Peng,
Yunzhuo Hao,
Weijie Qiu,
Ai Jian,
Tianyidan Xie,
Xuchen Song,
Yang Liu,
Yahui Zhou
Abstract:
We propose Skywork-VL Reward, a multimodal reward model that provides reward signals for both multimodal understanding and reasoning tasks. Our technical approach comprises two key components: First, we construct a large-scale multimodal preference dataset that covers a wide range of tasks and scenarios, with responses collected from both standard vision-language models (VLMs) and advanced VLM rea…
▽ More
We propose Skywork-VL Reward, a multimodal reward model that provides reward signals for both multimodal understanding and reasoning tasks. Our technical approach comprises two key components: First, we construct a large-scale multimodal preference dataset that covers a wide range of tasks and scenarios, with responses collected from both standard vision-language models (VLMs) and advanced VLM reasoners. Second, we design a reward model architecture based on Qwen2.5-VL-7B-Instruct, integrating a reward head and applying multi-stage fine-tuning using pairwise ranking loss on pairwise preference data. Experimental evaluations show that Skywork-VL Reward achieves state-of-the-art results on multimodal VL-RewardBench and exhibits competitive performance on the text-only RewardBench benchmark. Furthermore, preference data constructed based on our Skywork-VL Reward proves highly effective for training Mixed Preference Optimization (MPO), leading to significant improvements in multimodal reasoning capabilities. Our results underscore Skywork-VL Reward as a significant advancement toward general-purpose, reliable reward models for multimodal alignment. Our model has been publicly released to promote transparency and reproducibility.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
Seed1.5-VL Technical Report
Authors:
Dong Guo,
Faming Wu,
Feida Zhu,
Fuxing Leng,
Guang Shi,
Haobin Chen,
Haoqi Fan,
Jian Wang,
Jianyu Jiang,
Jiawei Wang,
Jingji Chen,
Jingjia Huang,
Kang Lei,
Liping Yuan,
Lishu Luo,
Pengfei Liu,
Qinghao Ye,
Rui Qian,
Shen Yan,
Shixiong Zhao,
Shuai Peng,
Shuangye Li,
Sihang Yuan,
Sijin Wu,
Tianheng Cheng
, et al. (172 additional authors not shown)
Abstract:
We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluati…
▽ More
We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at https://www.volcengine.com/ (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
Assessing the Robustness and Reducibility of Multiplex Networks with Embedding-Aided Interlayer Similarities
Authors:
Haoran Nan,
Senquan Wang,
Chun Ouyang,
Yanchen Zhou,
Weiwei Gu
Abstract:
The study of interlayer similarity of multiplex networks helps to understand the intrinsic structure of complex systems, revealing how changes in one layer can propagate and affect others, thus enabling broad implications for transportation, social, and biological systems. Existing algorithms that measure similarity between network layers typically encode only partial information, which limits the…
▽ More
The study of interlayer similarity of multiplex networks helps to understand the intrinsic structure of complex systems, revealing how changes in one layer can propagate and affect others, thus enabling broad implications for transportation, social, and biological systems. Existing algorithms that measure similarity between network layers typically encode only partial information, which limits their effectiveness in capturing the full complexity inherent in multiplex networks. To address this limitation, we propose a novel interlayer similarity measuring approach named Embedding Aided inTerlayer Similarity (EATSim). EATSim concurrently incorporates intralayer structural similarity and cross-layer anchor node alignment consistency, providing a more comprehensive framework for analyzing interconnected systems. Extensive experiments on both synthetic and real-world networks demonstrate that EATSim effectively captures the underlying geometric similarities between interconnected networks, significantly improving the accuracy of interlayer similarity measurement. Moreover, EATSim achieves state-of-the-art performance in two downstream applications: predicting network robustness and network reducibility, showing its great potential in enhancing the understanding and management of complex systems.
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
Bi-level Mean Field: Dynamic Grouping for Large-Scale MARL
Authors:
Yuxuan Zheng,
Yihe Zhou,
Feiyang Xu,
Mingli Song,
Shunyu Liu
Abstract:
Large-scale Multi-Agent Reinforcement Learning (MARL) often suffers from the curse of dimensionality, as the exponential growth in agent interactions significantly increases computational complexity and impedes learning efficiency. To mitigate this, existing efforts that rely on Mean Field (MF) simplify the interaction landscape by approximating neighboring agents as a single mean agent, thus redu…
▽ More
Large-scale Multi-Agent Reinforcement Learning (MARL) often suffers from the curse of dimensionality, as the exponential growth in agent interactions significantly increases computational complexity and impedes learning efficiency. To mitigate this, existing efforts that rely on Mean Field (MF) simplify the interaction landscape by approximating neighboring agents as a single mean agent, thus reducing overall complexity to pairwise interactions. However, these MF methods inevitably fail to account for individual differences, leading to aggregation noise caused by inaccurate iterative updates during MF learning. In this paper, we propose a Bi-level Mean Field (BMF) method to capture agent diversity with dynamic grouping in large-scale MARL, which can alleviate aggregation noise via bi-level interaction. Specifically, BMF introduces a dynamic group assignment module, which employs a Variational AutoEncoder (VAE) to learn the representations of agents, facilitating their dynamic grouping over time. Furthermore, we propose a bi-level interaction module to model both inter- and intra-group interactions for effective neighboring aggregation. Experiments across various tasks demonstrate that the proposed BMF yields results superior to the state-of-the-art methods. Our code will be made publicly available.
△ Less
Submitted 10 May, 2025;
originally announced May 2025.
-
xGen-small Technical Report
Authors:
Erik Nijkamp,
Bo Pang,
Egor Pakhomov,
Akash Gokul,
Jin Qu,
Silvio Savarese,
Yingbo Zhou,
Caiming Xiong
Abstract:
We introduce xGen-small, a family of 4B and 9B Transformer decoder models optimized for long-context applications. Our vertically integrated pipeline unites domain-balanced, frequency-aware data curation; multi-stage pre-training with quality annealing and length extension to 128k tokens; and targeted post-training via supervised fine-tuning, preference learning, and online reinforcement learning.…
▽ More
We introduce xGen-small, a family of 4B and 9B Transformer decoder models optimized for long-context applications. Our vertically integrated pipeline unites domain-balanced, frequency-aware data curation; multi-stage pre-training with quality annealing and length extension to 128k tokens; and targeted post-training via supervised fine-tuning, preference learning, and online reinforcement learning. xGen-small delivers strong performance across various tasks, especially in math and coding domains, while excelling at long context benchmarks.
△ Less
Submitted 9 May, 2025;
originally announced May 2025.
-
LLMs Get Lost In Multi-Turn Conversation
Authors:
Philippe Laban,
Hiroaki Hayashi,
Yingbo Zhou,
Jennifer Neville
Abstract:
Large Language Models (LLMs) are conversational interfaces. As such, LLMs have the potential to assist their users not only when they can fully specify the task at hand, but also to help them define, explore, and refine what they need through multi-turn conversational exchange. Although analysis of LLM conversation logs has confirmed that underspecification occurs frequently in user instructions,…
▽ More
Large Language Models (LLMs) are conversational interfaces. As such, LLMs have the potential to assist their users not only when they can fully specify the task at hand, but also to help them define, explore, and refine what they need through multi-turn conversational exchange. Although analysis of LLM conversation logs has confirmed that underspecification occurs frequently in user instructions, LLM evaluation has predominantly focused on the single-turn, fully-specified instruction setting. In this work, we perform large-scale simulation experiments to compare LLM performance in single- and multi-turn settings. Our experiments confirm that all the top open- and closed-weight LLMs we test exhibit significantly lower performance in multi-turn conversations than single-turn, with an average drop of 39% across six generation tasks. Analysis of 200,000+ simulated conversations decomposes the performance degradation into two components: a minor loss in aptitude and a significant increase in unreliability. We find that LLMs often make assumptions in early turns and prematurely attempt to generate final solutions, on which they overly rely. In simpler terms, we discover that *when LLMs take a wrong turn in a conversation, they get lost and do not recover*.
△ Less
Submitted 9 May, 2025;
originally announced May 2025.
-
Towards Better Cephalometric Landmark Detection with Diffusion Data Generation
Authors:
Dongqian Guo,
Wencheng Han,
Pang Lyu,
Yuxi Zhou,
Jianbing Shen
Abstract:
Cephalometric landmark detection is essential for orthodontic diagnostics and treatment planning. Nevertheless, the scarcity of samples in data collection and the extensive effort required for manual annotation have significantly impeded the availability of diverse datasets. This limitation has restricted the effectiveness of deep learning-based detection methods, particularly those based on large…
▽ More
Cephalometric landmark detection is essential for orthodontic diagnostics and treatment planning. Nevertheless, the scarcity of samples in data collection and the extensive effort required for manual annotation have significantly impeded the availability of diverse datasets. This limitation has restricted the effectiveness of deep learning-based detection methods, particularly those based on large-scale vision models. To address these challenges, we have developed an innovative data generation method capable of producing diverse cephalometric X-ray images along with corresponding annotations without human intervention. To achieve this, our approach initiates by constructing new cephalometric landmark annotations using anatomical priors. Then, we employ a diffusion-based generator to create realistic X-ray images that correspond closely with these annotations. To achieve precise control in producing samples with different attributes, we introduce a novel prompt cephalometric X-ray image dataset. This dataset includes real cephalometric X-ray images and detailed medical text prompts describing the images. By leveraging these detailed prompts, our method improves the generation process to control different styles and attributes. Facilitated by the large, diverse generated data, we introduce large-scale vision detection models into the cephalometric landmark detection task to improve accuracy. Experimental results demonstrate that training with the generated data substantially enhances the performance. Compared to methods without using the generated data, our approach improves the Success Detection Rate (SDR) by 6.5%, attaining a notable 82.2%. All code and data are available at: https://um-lab.github.io/cepha-generation
△ Less
Submitted 9 May, 2025;
originally announced May 2025.
-
FloE: On-the-Fly MoE Inference on Memory-constrained GPU
Authors:
Yuxin Zhou,
Zheng Li,
Jun Zhang,
Jue Wang,
Yiping Wang,
Zhongle Xie,
Ke Chen,
Lidan Shou
Abstract:
With the widespread adoption of Mixture-of-Experts (MoE) models, there is a growing demand for efficient inference on memory-constrained devices. While offloading expert parameters to CPU memory and loading activated experts on demand has emerged as a potential solution, the large size of activated experts overburdens the limited PCIe bandwidth, hindering the effectiveness in latency-sensitive sce…
▽ More
With the widespread adoption of Mixture-of-Experts (MoE) models, there is a growing demand for efficient inference on memory-constrained devices. While offloading expert parameters to CPU memory and loading activated experts on demand has emerged as a potential solution, the large size of activated experts overburdens the limited PCIe bandwidth, hindering the effectiveness in latency-sensitive scenarios. To mitigate this, we propose FloE, an on-the-fly MoE inference system on memory-constrained GPUs. FloE is built on the insight that there exists substantial untapped redundancy within sparsely activated experts. It employs various compression techniques on the expert's internal parameter matrices to reduce the data movement load, combined with low-cost sparse prediction, achieving perceptible inference acceleration in wall-clock time on resource-constrained devices. Empirically, FloE achieves a 9.3x compression of parameters per expert in Mixtral-8x7B; enables deployment on a GPU with only 11GB VRAM, reducing the memory footprint by up to 8.5x; and delivers a 48.7x inference speedup compared to DeepSpeed-MII on a single GeForce RTX 3090 - all with only a 4.4$\%$ - 7.6$\%$ average performance degradation.
△ Less
Submitted 11 May, 2025; v1 submitted 9 May, 2025;
originally announced May 2025.
-
IRNN: Innovation-driven Recurrent Neural Network for Time-Series Data Modeling and Prediction
Authors:
Yifan Zhou,
Yibo Wang,
Chao Shang
Abstract:
Many real-world datasets are time series that are sequentially collected and contain rich temporal information. Thus, a common interest in practice is to capture dynamics of time series and predict their future evolutions. To this end, the recurrent neural network (RNN) has been a prevalent and effective machine learning option, which admits a nonlinear state-space model representation. Motivated…
▽ More
Many real-world datasets are time series that are sequentially collected and contain rich temporal information. Thus, a common interest in practice is to capture dynamics of time series and predict their future evolutions. To this end, the recurrent neural network (RNN) has been a prevalent and effective machine learning option, which admits a nonlinear state-space model representation. Motivated by the resemblance between RNN and Kalman filter (KF) for linear state-space models, we propose in this paper Innovation-driven RNN (IRNN), a novel RNN architecture tailored to time-series data modeling and prediction tasks. By adapting the concept of "innovation" from KF to RNN, past prediction errors are adopted as additional input signals to update hidden states of RNN and boost prediction performance. Since innovation data depend on network parameters, existing training algorithms for RNN do not apply to IRNN straightforwardly. Thus, a tailored training algorithm dubbed input updating-based back-propagation through time (IU-BPTT) is further proposed, which alternates between updating innovations and optimizing network parameters via gradient descent. Experiments on real-world benchmark datasets show that the integration of innovations into various forms of RNN leads to remarkably improved prediction accuracy of IRNN without increasing the training cost substantially.
△ Less
Submitted 9 May, 2025;
originally announced May 2025.
-
Towards Facial Image Compression with Consistency Preserving Diffusion Prior
Authors:
Yimin Zhou,
Yichong Xia,
Bin Chen,
Baoyi An,
Haoqian Wang,
Zhi Wang,
Yaowei Wang,
Zikun Zhou
Abstract:
With the widespread application of facial image data across various domains, the efficient storage and transmission of facial images has garnered significant attention. However, the existing learned face image compression methods often produce unsatisfactory reconstructed image quality at low bit rates. Simply adapting diffusion-based compression methods to facial compression tasks results in reco…
▽ More
With the widespread application of facial image data across various domains, the efficient storage and transmission of facial images has garnered significant attention. However, the existing learned face image compression methods often produce unsatisfactory reconstructed image quality at low bit rates. Simply adapting diffusion-based compression methods to facial compression tasks results in reconstructed images that perform poorly in downstream applications due to insufficient preservation of high-frequency information. To further explore the diffusion prior in facial image compression, we propose Facial Image Compression with a Stable Diffusion Prior (FaSDiff), a method that preserves consistency through frequency enhancement. FaSDiff employs a high-frequency-sensitive compressor in an end-to-end framework to capture fine image details and produce robust visual prompts. Additionally, we introduce a hybrid low-frequency enhancement module that disentangles low-frequency facial semantics and stably modulates the diffusion prior alongside visual prompts. The proposed modules allow FaSDiff to leverage diffusion priors for superior human visual perception while minimizing performance loss in machine vision due to semantic inconsistency. Extensive experiments show that FaSDiff outperforms state-of-the-art methods in balancing human visual quality and machine vision accuracy. The code will be released after the paper is accepted.
△ Less
Submitted 9 May, 2025;
originally announced May 2025.
-
Predicting Diabetic Macular Edema Treatment Responses Using OCT: Dataset and Methods of APTOS Competition
Authors:
Weiyi Zhang,
Peranut Chotcomwongse,
Yinwen Li,
Pusheng Xu,
Ruijie Yao,
Lianhao Zhou,
Yuxuan Zhou,
Hui Feng,
Qiping Zhou,
Xinyue Wang,
Shoujin Huang,
Zihao Jin,
Florence H. T. Chung,
Shujun Wang,
Yalin Zheng,
Mingguang He,
Danli Shi,
Paisan Ruamviboonsuk
Abstract:
Diabetic macular edema (DME) significantly contributes to visual impairment in diabetic patients. Treatment responses to intravitreal therapies vary, highlighting the need for patient stratification to predict therapeutic benefits and enable personalized strategies. To our knowledge, this study is the first to explore pre-treatment stratification for predicting DME treatment responses. To advance…
▽ More
Diabetic macular edema (DME) significantly contributes to visual impairment in diabetic patients. Treatment responses to intravitreal therapies vary, highlighting the need for patient stratification to predict therapeutic benefits and enable personalized strategies. To our knowledge, this study is the first to explore pre-treatment stratification for predicting DME treatment responses. To advance this research, we organized the 2nd Asia-Pacific Tele-Ophthalmology Society (APTOS) Big Data Competition in 2021. The competition focused on improving predictive accuracy for anti-VEGF therapy responses using ophthalmic OCT images. We provided a dataset containing tens of thousands of OCT images from 2,000 patients with labels across four sub-tasks. This paper details the competition's structure, dataset, leading methods, and evaluation metrics. The competition attracted strong scientific community participation, with 170 teams initially registering and 41 reaching the final round. The top-performing team achieved an AUC of 80.06%, highlighting the potential of AI in personalized DME treatment and clinical decision-making.
△ Less
Submitted 9 May, 2025;
originally announced May 2025.
-
BMMDetect: A Multimodal Deep Learning Framework for Comprehensive Biomedical Misconduct Detection
Authors:
Yize Zhou,
Jie Zhang,
Meijie Wang,
Lun Yu
Abstract:
Academic misconduct detection in biomedical research remains challenging due to algorithmic narrowness in existing methods and fragmented analytical pipelines. We present BMMDetect, a multimodal deep learning framework that integrates journal metadata (SJR, institutional data), semantic embeddings (PubMedBERT), and GPT-4o-mined textual attributes (methodological statistics, data anomalies) for hol…
▽ More
Academic misconduct detection in biomedical research remains challenging due to algorithmic narrowness in existing methods and fragmented analytical pipelines. We present BMMDetect, a multimodal deep learning framework that integrates journal metadata (SJR, institutional data), semantic embeddings (PubMedBERT), and GPT-4o-mined textual attributes (methodological statistics, data anomalies) for holistic manuscript evaluation. Key innovations include: (1) multimodal fusion of domain-specific features to reduce detection bias; (2) quantitative evaluation of feature importance, identifying journal authority metrics (e.g., SJR-index) and textual anomalies (e.g., statistical outliers) as dominant predictors; and (3) the BioMCD dataset, a large-scale benchmark with 13,160 retracted articles and 53,411 controls. BMMDetect achieves 74.33% AUC, outperforming single-modality baselines by 8.6%, and demonstrates transferability across biomedical subfields. This work advances scalable, interpretable tools for safeguarding research integrity.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
Rethinking Graph Contrastive Learning through Relative Similarity Preservation
Authors:
Zhiyuan Ning,
Pengfei Wang,
Ziyue Qiao,
Pengyang Wang,
Yuanchun Zhou
Abstract:
Graph contrastive learning (GCL) has achieved remarkable success by following the computer vision paradigm of preserving absolute similarity between augmented views. However, this approach faces fundamental challenges in graphs due to their discrete, non-Euclidean nature -- view generation often breaks semantic validity and similarity verification becomes unreliable. Through analyzing 11 real-worl…
▽ More
Graph contrastive learning (GCL) has achieved remarkable success by following the computer vision paradigm of preserving absolute similarity between augmented views. However, this approach faces fundamental challenges in graphs due to their discrete, non-Euclidean nature -- view generation often breaks semantic validity and similarity verification becomes unreliable. Through analyzing 11 real-world graphs, we discover a universal pattern transcending the homophily-heterophily dichotomy: label consistency systematically diminishes as structural distance increases, manifesting as smooth decay in homophily graphs and oscillatory decay in heterophily graphs. We establish theoretical guarantees for this pattern through random walk theory, proving label distribution convergence and characterizing the mechanisms behind different decay behaviors. This discovery reveals that graphs naturally encode relative similarity patterns, where structurally closer nodes exhibit collectively stronger semantic relationships. Leveraging this insight, we propose RELGCL, a novel GCL framework with complementary pairwise and listwise implementations that preserve these inherent patterns through collective similarity objectives. Extensive experiments demonstrate that our method consistently outperforms 20 existing approaches across both homophily and heterophily graphs, validating the effectiveness of leveraging natural relative similarity over artificial absolute similarity.
△ Less
Submitted 12 May, 2025; v1 submitted 8 May, 2025;
originally announced May 2025.
-
CLAM: Continuous Latent Action Models for Robot Learning from Unlabeled Demonstrations
Authors:
Anthony Liang,
Pavel Czempin,
Matthew Hong,
Yutai Zhou,
Erdem Biyik,
Stephen Tu
Abstract:
Learning robot policies using imitation learning requires collecting large amounts of costly action-labeled expert demonstrations, which fundamentally limits the scale of training data. A promising approach to address this bottleneck is to harness the abundance of unlabeled observations-e.g., from video demonstrations-to learn latent action labels in an unsupervised way. However, we find that exis…
▽ More
Learning robot policies using imitation learning requires collecting large amounts of costly action-labeled expert demonstrations, which fundamentally limits the scale of training data. A promising approach to address this bottleneck is to harness the abundance of unlabeled observations-e.g., from video demonstrations-to learn latent action labels in an unsupervised way. However, we find that existing methods struggle when applied to complex robot tasks requiring fine-grained motions. We design continuous latent action models (CLAM) which incorporate two key ingredients we find necessary for learning to solve complex continuous control tasks from unlabeled observation data: (a) using continuous latent action labels instead of discrete representations, and (b) jointly training an action decoder to ensure that the latent action space can be easily grounded to real actions with relatively few labeled examples. Importantly, the labeled examples can be collected from non-optimal play data, enabling CLAM to learn performant policies without access to any action-labeled expert data. We demonstrate on continuous control benchmarks in DMControl (locomotion) and MetaWorld (manipulation), as well as on a real WidowX robot arm that CLAM significantly outperforms prior state-of-the-art methods, remarkably with a 2-3x improvement in task success rate compared to the best baseline. Videos and code can be found at clamrobot.github.io.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
CPP-DIP: Multi-objective Coverage Path Planning for MAVs in Dispersed and Irregular Plantations
Authors:
Weijie Kuang,
Hann Woei Ho,
Ye Zhou
Abstract:
Coverage Path Planning (CPP) is vital in precision agriculture to improve efficiency and resource utilization. In irregular and dispersed plantations, traditional grid-based CPP often causes redundant coverage over non-vegetated areas, leading to waste and pollution. To overcome these limitations, we propose CPP-DIP, a multi-objective CPP framework designed for Micro Air Vehicles (MAVs). The frame…
▽ More
Coverage Path Planning (CPP) is vital in precision agriculture to improve efficiency and resource utilization. In irregular and dispersed plantations, traditional grid-based CPP often causes redundant coverage over non-vegetated areas, leading to waste and pollution. To overcome these limitations, we propose CPP-DIP, a multi-objective CPP framework designed for Micro Air Vehicles (MAVs). The framework transforms the CPP task into a Traveling Salesman Problem (TSP) and optimizes flight paths by minimizing travel distance, turning angles, and intersection counts. Unlike conventional approaches, our method does not rely on GPS-based environmental modeling. Instead, it uses aerial imagery and a Histogram of Oriented Gradients (HOG)-based approach to detect trees and extract image coordinates. A density-aware waypoint strategy is applied: Kernel Density Estimation (KDE) is used to reduce redundant waypoints in dense regions, while a greedy algorithm ensures complete coverage in sparse areas. To verify the generality of the framework, we solve the resulting TSP using three different methods: Greedy Heuristic Insertion (GHI), Ant Colony Optimization (ACO), and Monte Carlo Reinforcement Learning (MCRL). Then an object-based optimization is applied to further refine the resulting path. Additionally, CPP-DIP integrates ForaNav, our insect-inspired navigation method, for accurate tree localization and tracking. The experimental results show that MCRL offers a balanced solution, reducing the travel distance by 16.9 % compared to ACO while maintaining a similar performance to GHI. It also improves path smoothness by reducing turning angles by 28.3 % and 59.9 % relative to ACO and GHI, respectively, and effectively eliminates intersections. These results confirm the robustness and effectiveness of CPP-DIP in different TSP solvers.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
FedRE: Robust and Effective Federated Learning with Privacy Preference
Authors:
Tianzhe Xiao,
Yichen Li,
Yu Zhou,
Yining Qi,
Yi Liu,
Wei Wang,
Haozhao Wang,
Yi Wang,
Ruixuan Li
Abstract:
Despite Federated Learning (FL) employing gradient aggregation at the server for distributed training to prevent the privacy leakage of raw data, private information can still be divulged through the analysis of uploaded gradients from clients. Substantial efforts have been made to integrate local differential privacy (LDP) into the system to achieve a strict privacy guarantee. However, existing m…
▽ More
Despite Federated Learning (FL) employing gradient aggregation at the server for distributed training to prevent the privacy leakage of raw data, private information can still be divulged through the analysis of uploaded gradients from clients. Substantial efforts have been made to integrate local differential privacy (LDP) into the system to achieve a strict privacy guarantee. However, existing methods fail to take practical issues into account by merely perturbing each sample with the same mechanism while each client may have their own privacy preferences on privacy-sensitive information (PSI), which is not uniformly distributed across the raw data. In such a case, excessive privacy protection from private-insensitive information can additionally introduce unnecessary noise, which may degrade the model performance. In this work, we study the PSI within data and develop FedRE, that can simultaneously achieve robustness and effectiveness benefits with LDP protection. More specifically, we first define PSI with regard to the privacy preferences of each client. Then, we optimize the LDP by allocating less privacy budget to gradients with higher PSI in a layer-wise manner, thus providing a stricter privacy guarantee for PSI. Furthermore, to mitigate the performance degradation caused by LDP, we design a parameter aggregation mechanism based on the distribution of the perturbed information. We conducted experiments with text tamper detection on T-SROIE and DocTamper datasets, and FedRE achieves competitive performance compared to state-of-the-art methods.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
Learning from Loss Landscape: Generalizable Mixed-Precision Quantization via Adaptive Sharpness-Aware Gradient Aligning
Authors:
Lianbo Ma,
Jianlun Ma,
Yuee Zhou,
Guoyang Xie,
Qiang He,
Zhichao Lu
Abstract:
Mixed Precision Quantization (MPQ) has become an essential technique for optimizing neural network by determining the optimal bitwidth per layer. Existing MPQ methods, however, face a major hurdle: they require a computationally expensive search for quantization policies on large-scale datasets. To resolve this issue, we introduce a novel approach that first searches for quantization policies on s…
▽ More
Mixed Precision Quantization (MPQ) has become an essential technique for optimizing neural network by determining the optimal bitwidth per layer. Existing MPQ methods, however, face a major hurdle: they require a computationally expensive search for quantization policies on large-scale datasets. To resolve this issue, we introduce a novel approach that first searches for quantization policies on small datasets and then generalizes them to large-scale datasets. This approach simplifies the process, eliminating the need for large-scale quantization fine-tuning and only necessitating model weight adjustment. Our method is characterized by three key techniques: sharpness-aware minimization for enhanced quantization generalization, implicit gradient direction alignment to handle gradient conflicts among different optimization objectives, and an adaptive perturbation radius to accelerate optimization. Both theoretical analysis and experimental results validate our approach. Using the CIFAR10 dataset (just 0.5\% the size of ImageNet training data) for MPQ policy search, we achieved equivalent accuracy on ImageNet with a significantly lower computational cost, while improving efficiency by up to 150% over the baselines.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
Convex Relaxation for Robust Vanishing Point Estimation in Manhattan World
Authors:
Bangyan Liao,
Zhenjun Zhao,
Haoang Li,
Yi Zhou,
Yingping Zeng,
Hao Li,
Peidong Liu
Abstract:
Determining the vanishing points (VPs) in a Manhattan world, as a fundamental task in many 3D vision applications, consists of jointly inferring the line-VP association and locating each VP. Existing methods are, however, either sub-optimal solvers or pursuing global optimality at a significant cost of computing time. In contrast to prior works, we introduce convex relaxation techniques to solve t…
▽ More
Determining the vanishing points (VPs) in a Manhattan world, as a fundamental task in many 3D vision applications, consists of jointly inferring the line-VP association and locating each VP. Existing methods are, however, either sub-optimal solvers or pursuing global optimality at a significant cost of computing time. In contrast to prior works, we introduce convex relaxation techniques to solve this task for the first time. Specifically, we employ a ``soft'' association scheme, realized via a truncated multi-selection error, that allows for joint estimation of VPs' locations and line-VP associations. This approach leads to a primal problem that can be reformulated into a quadratically constrained quadratic programming (QCQP) problem, which is then relaxed into a convex semidefinite programming (SDP) problem. To solve this SDP problem efficiently, we present a globally optimal outlier-robust iterative solver (called \textbf{GlobustVP}), which independently searches for one VP and its associated lines in each iteration, treating other lines as outliers. After each independent update of all VPs, the mutual orthogonality between the three VPs in a Manhattan world is reinforced via local refinement. Extensive experiments on both synthetic and real-world data demonstrate that \textbf{GlobustVP} achieves a favorable balance between efficiency, robustness, and global optimality compared to previous works. The code is publicly available at https://github.com/WU-CVGL/GlobustVP.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
PrimitiveAnything: Human-Crafted 3D Primitive Assembly Generation with Auto-Regressive Transformer
Authors:
Jingwen Ye,
Yuze He,
Yanning Zhou,
Yiqin Zhu,
Kaiwen Xiao,
Yong-Jin Liu,
Wei Yang,
Xiao Han
Abstract:
Shape primitive abstraction, which decomposes complex 3D shapes into simple geometric elements, plays a crucial role in human visual cognition and has broad applications in computer vision and graphics. While recent advances in 3D content generation have shown remarkable progress, existing primitive abstraction methods either rely on geometric optimization with limited semantic understanding or le…
▽ More
Shape primitive abstraction, which decomposes complex 3D shapes into simple geometric elements, plays a crucial role in human visual cognition and has broad applications in computer vision and graphics. While recent advances in 3D content generation have shown remarkable progress, existing primitive abstraction methods either rely on geometric optimization with limited semantic understanding or learn from small-scale, category-specific datasets, struggling to generalize across diverse shape categories. We present PrimitiveAnything, a novel framework that reformulates shape primitive abstraction as a primitive assembly generation task. PrimitiveAnything includes a shape-conditioned primitive transformer for auto-regressive generation and an ambiguity-free parameterization scheme to represent multiple types of primitives in a unified manner. The proposed framework directly learns the process of primitive assembly from large-scale human-crafted abstractions, enabling it to capture how humans decompose complex shapes into primitive elements. Through extensive experiments, we demonstrate that PrimitiveAnything can generate high-quality primitive assemblies that better align with human perception while maintaining geometric fidelity across diverse shape categories. It benefits various 3D applications and shows potential for enabling primitive-based user-generated content (UGC) in games. Project page: https://primitiveanything.github.io
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
On Path to Multimodal Generalist: General-Level and General-Bench
Authors:
Hao Fei,
Yuan Zhou,
Juncheng Li,
Xiangtai Li,
Qingshan Xu,
Bobo Li,
Shengqiong Wu,
Yaoting Wang,
Junbao Zhou,
Jiahao Meng,
Qingyu Shi,
Zhiyuan Zhou,
Liangtao Shi,
Minghe Gao,
Daoan Zhang,
Zhiqi Ge,
Weiming Wu,
Siliang Tang,
Kaihang Pan,
Yaobo Ye,
Haobo Yuan,
Tao Zhang,
Tianjie Ju,
Zixiang Meng,
Shilin Xu
, et al. (7 additional authors not shown)
Abstract:
The Multimodal Large Language Model (MLLM) is currently experiencing rapid growth, driven by the advanced capabilities of LLMs. Unlike earlier specialists, existing MLLMs are evolving towards a Multimodal Generalist paradigm. Initially limited to understanding multiple modalities, these models have advanced to not only comprehend but also generate across modalities. Their capabilities have expande…
▽ More
The Multimodal Large Language Model (MLLM) is currently experiencing rapid growth, driven by the advanced capabilities of LLMs. Unlike earlier specialists, existing MLLMs are evolving towards a Multimodal Generalist paradigm. Initially limited to understanding multiple modalities, these models have advanced to not only comprehend but also generate across modalities. Their capabilities have expanded from coarse-grained to fine-grained multimodal understanding and from supporting limited modalities to arbitrary ones. While many benchmarks exist to assess MLLMs, a critical question arises: Can we simply assume that higher performance across tasks indicates a stronger MLLM capability, bringing us closer to human-level AI? We argue that the answer is not as straightforward as it seems. This project introduces General-Level, an evaluation framework that defines 5-scale levels of MLLM performance and generality, offering a methodology to compare MLLMs and gauge the progress of existing systems towards more robust multimodal generalists and, ultimately, towards AGI. At the core of the framework is the concept of Synergy, which measures whether models maintain consistent capabilities across comprehension and generation, and across multiple modalities. To support this evaluation, we present General-Bench, which encompasses a broader spectrum of skills, modalities, formats, and capabilities, including over 700 tasks and 325,800 instances. The evaluation results that involve over 100 existing state-of-the-art MLLMs uncover the capability rankings of generalists, highlighting the challenges in reaching genuine AI. We expect this project to pave the way for future research on next-generation multimodal foundation models, providing a robust infrastructure to accelerate the realization of AGI. Project page: https://generalist.top/
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation
Authors:
Teng Hu,
Zhentao Yu,
Zhengguang Zhou,
Sen Liang,
Yuan Zhou,
Qin Lin,
Qinglin Lu
Abstract:
Customized video generation aims to produce videos featuring specific subjects under flexible user-defined conditions, yet existing methods often struggle with identity consistency and limited input modalities. In this paper, we propose HunyuanCustom, a multi-modal customized video generation framework that emphasizes subject consistency while supporting image, audio, video, and text conditions. B…
▽ More
Customized video generation aims to produce videos featuring specific subjects under flexible user-defined conditions, yet existing methods often struggle with identity consistency and limited input modalities. In this paper, we propose HunyuanCustom, a multi-modal customized video generation framework that emphasizes subject consistency while supporting image, audio, video, and text conditions. Built upon HunyuanVideo, our model first addresses the image-text conditioned generation task by introducing a text-image fusion module based on LLaVA for enhanced multi-modal understanding, along with an image ID enhancement module that leverages temporal concatenation to reinforce identity features across frames. To enable audio- and video-conditioned generation, we further propose modality-specific condition injection mechanisms: an AudioNet module that achieves hierarchical alignment via spatial cross-attention, and a video-driven injection module that integrates latent-compressed conditional video through a patchify-based feature-alignment network. Extensive experiments on single- and multi-subject scenarios demonstrate that HunyuanCustom significantly outperforms state-of-the-art open- and closed-source methods in terms of ID consistency, realism, and text-video alignment. Moreover, we validate its robustness across downstream tasks, including audio and video-driven customized video generation. Our results highlight the effectiveness of multi-modal conditioning and identity-preserving strategies in advancing controllable video generation. All the code and models are available at https://hunyuancustom.github.io.
△ Less
Submitted 8 May, 2025; v1 submitted 7 May, 2025;
originally announced May 2025.
-
Efficient Fine-Tuning of Quantized Models via Adaptive Rank and Bitwidth
Authors:
Changhai Zhou,
Yuhua Zhou,
Qian Qiao,
Weizhong Zhang,
Cheng Jin
Abstract:
QLoRA effectively combines low-bit quantization and LoRA to achieve memory-friendly fine-tuning for large language models (LLM). Recently, methods based on SVD for continuous update iterations to initialize LoRA matrices to accommodate quantization errors have generally failed to consistently improve performance. Dynamic mixed precision is a natural idea for continuously improving the fine-tuning…
▽ More
QLoRA effectively combines low-bit quantization and LoRA to achieve memory-friendly fine-tuning for large language models (LLM). Recently, methods based on SVD for continuous update iterations to initialize LoRA matrices to accommodate quantization errors have generally failed to consistently improve performance. Dynamic mixed precision is a natural idea for continuously improving the fine-tuning performance of quantized models, but previous methods often optimize low-rank subspaces or quantization components separately, without considering their synergy. To address this, we propose \textbf{QR-Adaptor}, a unified, gradient-free strategy that uses partial calibration data to jointly search the quantization components and the rank of low-rank spaces for each layer, thereby continuously improving model performance. QR-Adaptor does not minimize quantization error but treats precision and rank allocation as a discrete optimization problem guided by actual downstream performance and memory usage. Compared to state-of-the-art (SOTA) quantized LoRA fine-tuning methods, our approach achieves a 4.89\% accuracy improvement on GSM8K, and in some cases even outperforms the 16-bit fine-tuned model while maintaining the memory footprint of the 4-bit setting.
△ Less
Submitted 2 May, 2025;
originally announced May 2025.
-
Moral Testing of Autonomous Driving Systems
Authors:
Wenbing Tang,
Mingfei Cheng,
Yuan Zhou,
Yang Liu
Abstract:
Autonomous Driving System (ADS) testing plays a crucial role in their development, with the current focus primarily on functional and safety testing. However, evaluating the non-functional morality of ADSs, particularly their decision-making capabilities in unavoidable collision scenarios, is equally important to ensure the systems' trustworthiness and public acceptance. Unfortunately, testing ADS…
▽ More
Autonomous Driving System (ADS) testing plays a crucial role in their development, with the current focus primarily on functional and safety testing. However, evaluating the non-functional morality of ADSs, particularly their decision-making capabilities in unavoidable collision scenarios, is equally important to ensure the systems' trustworthiness and public acceptance. Unfortunately, testing ADS morality is nearly impossible due to the absence of universal moral principles. To address this challenge, this paper first extracts a set of moral meta-principles derived from existing moral experiments and well-established social science theories, aiming to capture widely recognized and common-sense moral values for ADSs. These meta-principles are then formalized as quantitative moral metamorphic relations, which act as the test oracle. Furthermore, we propose a metamorphic testing framework to systematically identify potential moral issues. Finally, we illustrate the implementation of the framework and present typical violation cases using the VIRES VTD simulator and its built-in ADS.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
Modality-Guided Dynamic Graph Fusion and Temporal Diffusion for Self-Supervised RGB-T Tracking
Authors:
Shenglan Li,
Rui Yao,
Yong Zhou,
Hancheng Zhu,
Kunyang Sun,
Bing Liu,
Zhiwen Shao,
Jiaqi Zhao
Abstract:
To reduce the reliance on large-scale annotations, self-supervised RGB-T tracking approaches have garnered significant attention. However, the omission of the object region by erroneous pseudo-label or the introduction of background noise affects the efficiency of modality fusion, while pseudo-label noise triggered by similar object noise can further affect the tracking performance. In this paper,…
▽ More
To reduce the reliance on large-scale annotations, self-supervised RGB-T tracking approaches have garnered significant attention. However, the omission of the object region by erroneous pseudo-label or the introduction of background noise affects the efficiency of modality fusion, while pseudo-label noise triggered by similar object noise can further affect the tracking performance. In this paper, we propose GDSTrack, a novel approach that introduces dynamic graph fusion and temporal diffusion to address the above challenges in self-supervised RGB-T tracking. GDSTrack dynamically fuses the modalities of neighboring frames, treats them as distractor noise, and leverages the denoising capability of a generative model. Specifically, by constructing an adjacency matrix via an Adjacency Matrix Generator (AMG), the proposed Modality-guided Dynamic Graph Fusion (MDGF) module uses a dynamic adjacency matrix to guide graph attention, focusing on and fusing the object's coherent regions. Temporal Graph-Informed Diffusion (TGID) models MDGF features from neighboring frames as interference, and thus improving robustness against similar-object noise. Extensive experiments conducted on four public RGB-T tracking datasets demonstrate that GDSTrack outperforms the existing state-of-the-art methods. The source code is available at https://github.com/LiShenglana/GDSTrack.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
Behavioral Sensing and Intervention Paradigm: A Review of Closed-Loop Approaches for Ingestion Health
Authors:
Jun Fang,
Yanuo Zhou,
Ka I Chan,
Jiajin Li,
Zeyi Sun,
Zhengnan Li,
Zicong Fu,
Hongjing Piao,
Haodong Xu,
Yuanchun Shi,
Yuntao Wang
Abstract:
Ingestive behavior plays a critical role in health, yet many existing interventions remain limited to static guidance or manual self-tracking. With the increasing integration of sensors and perceptual computing, recent systems have begun to support closed-loop interventions that dynamically sense user behavior and provide feedback during or around ingestion episodes. In this survey, we review 136…
▽ More
Ingestive behavior plays a critical role in health, yet many existing interventions remain limited to static guidance or manual self-tracking. With the increasing integration of sensors and perceptual computing, recent systems have begun to support closed-loop interventions that dynamically sense user behavior and provide feedback during or around ingestion episodes. In this survey, we review 136 studies that leverage sensor-enabled or interaction-mediated approaches to influence eating behavior. We propose a behavioral closed-loop paradigm comprising three core components: target behaviors, sensing modalities, and feedback strategies. A taxonomy of sensing and intervention modalities is presented, organized along human- and environment-based dimensions. Our analysis also examines evaluation methods and design trends across different modality-behavior pairings. This review reveals prevailing patterns and critical gaps, offering design insights for future adaptive and context-aware ingestion health interventions.
△ Less
Submitted 14 May, 2025; v1 submitted 6 May, 2025;
originally announced May 2025.
-
LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis
Authors:
Qingkai Fang,
Yan Zhou,
Shoutao Guo,
Shaolei Zhang,
Yang Feng
Abstract:
Real-time, intelligent, and natural speech interaction is an essential part of the next-generation human-computer interaction. Recent advancements have showcased the potential of building intelligent spoken chatbots based on large language models (LLMs). In this paper, we introduce LLaMA-Omni 2, a series of speech language models (SpeechLMs) ranging from 0.5B to 14B parameters, capable of achievin…
▽ More
Real-time, intelligent, and natural speech interaction is an essential part of the next-generation human-computer interaction. Recent advancements have showcased the potential of building intelligent spoken chatbots based on large language models (LLMs). In this paper, we introduce LLaMA-Omni 2, a series of speech language models (SpeechLMs) ranging from 0.5B to 14B parameters, capable of achieving high-quality real-time speech interaction. LLaMA-Omni 2 is built upon the Qwen2.5 series models, integrating a speech encoder and an autoregressive streaming speech decoder. Despite being trained on only 200K multi-turn speech dialogue samples, LLaMA-Omni 2 demonstrates strong performance on several spoken question answering and speech instruction following benchmarks, surpassing previous state-of-the-art SpeechLMs like GLM-4-Voice, which was trained on millions of hours of speech data.
△ Less
Submitted 5 May, 2025;
originally announced May 2025.
-
Quaternion Multi-focus Color Image Fusion
Authors:
Weihua Yang,
Yicong Zhou
Abstract:
Multi-focus color image fusion refers to integrating multiple partially focused color images to create a single all-in-focus color image. However, existing methods struggle with complex real-world scenarios due to limitations in handling color information and intricate textures. To address these challenges, this paper proposes a quaternion multi-focus color image fusion framework to perform high-q…
▽ More
Multi-focus color image fusion refers to integrating multiple partially focused color images to create a single all-in-focus color image. However, existing methods struggle with complex real-world scenarios due to limitations in handling color information and intricate textures. To address these challenges, this paper proposes a quaternion multi-focus color image fusion framework to perform high-quality color image fusion completely in the quaternion domain. This framework introduces 1) a quaternion sparse decomposition model to jointly learn fine-scale image details and structure information of color images in an iterative fashion for high-precision focus detection, 2) a quaternion base-detail fusion strategy to individually fuse base-scale and detail-scale results across multiple color images for preserving structure and detail information, and 3) a quaternion structural similarity refinement strategy to adaptively select optimal patches from initial fusion results and obtain the final fused result for preserving fine details and ensuring spatially consistent outputs. Extensive experiments demonstrate that the proposed framework outperforms state-of-the-art methods.
△ Less
Submitted 5 May, 2025;
originally announced May 2025.
-
Quaternion Infrared Visible Image Fusion
Authors:
Weihua Yang,
Yicong Zhou
Abstract:
Visible images provide rich details and color information only under well-lighted conditions while infrared images effectively highlight thermal targets under challenging conditions such as low visibility and adverse weather. Infrared-visible image fusion aims to integrate complementary information from infrared and visible images to generate a high-quality fused image. Existing methods exhibit cr…
▽ More
Visible images provide rich details and color information only under well-lighted conditions while infrared images effectively highlight thermal targets under challenging conditions such as low visibility and adverse weather. Infrared-visible image fusion aims to integrate complementary information from infrared and visible images to generate a high-quality fused image. Existing methods exhibit critical limitations such as neglecting color structure information in visible images and performance degradation when processing low-quality color-visible inputs. To address these issues, we propose a quaternion infrared-visible image fusion (QIVIF) framework to generate high-quality fused images completely in the quaternion domain. QIVIF proposes a quaternion low-visibility feature learning model to adaptively extract salient thermal targets and fine-grained texture details from input infrared and visible images respectively under diverse degraded conditions. QIVIF then develops a quaternion adaptive unsharp masking method to adaptively improve high-frequency feature enhancement with balanced illumination. QIVIF further proposes a quaternion hierarchical Bayesian fusion model to integrate infrared saliency and enhanced visible details to obtain high-quality fused images. Extensive experiments across diverse datasets demonstrate that our QIVIF surpasses state-of-the-art methods under challenging low-visibility conditions.
△ Less
Submitted 5 May, 2025;
originally announced May 2025.
-
TeDA: Boosting Vision-Lanuage Models for Zero-Shot 3D Object Retrieval via Testing-time Distribution Alignment
Authors:
Zhichuan Wang,
Yang Zhou,
Jinhai Xiang,
Yulong Wang,
Xinwei He
Abstract:
Learning discriminative 3D representations that generalize well to unknown testing categories is an emerging requirement for many real-world 3D applications. Existing well-established methods often struggle to attain this goal due to insufficient 3D training data from broader concepts. Meanwhile, pre-trained large vision-language models (e.g., CLIP) have shown remarkable zero-shot generalization c…
▽ More
Learning discriminative 3D representations that generalize well to unknown testing categories is an emerging requirement for many real-world 3D applications. Existing well-established methods often struggle to attain this goal due to insufficient 3D training data from broader concepts. Meanwhile, pre-trained large vision-language models (e.g., CLIP) have shown remarkable zero-shot generalization capabilities. Yet, they are limited in extracting suitable 3D representations due to substantial gaps between their 2D training and 3D testing distributions. To address these challenges, we propose Testing-time Distribution Alignment (TeDA), a novel framework that adapts a pretrained 2D vision-language model CLIP for unknown 3D object retrieval at test time. To our knowledge, it is the first work that studies the test-time adaptation of a vision-language model for 3D feature learning. TeDA projects 3D objects into multi-view images, extracts features using CLIP, and refines 3D query embeddings with an iterative optimization strategy by confident query-target sample pairs in a self-boosting manner. Additionally, TeDA integrates textual descriptions generated by a multimodal language model (InternVL) to enhance 3D object understanding, leveraging CLIP's aligned feature space to fuse visual and textual cues. Extensive experiments on four open-set 3D object retrieval benchmarks demonstrate that TeDA greatly outperforms state-of-the-art methods, even those requiring extensive training. We also experimented with depth maps on Objaverse-LVIS, further validating its effectiveness. Code is available at https://github.com/wangzhichuan123/TeDA.
△ Less
Submitted 4 May, 2025;
originally announced May 2025.
-
ForgeEDA: A Comprehensive Multimodal Dataset for Advancing EDA
Authors:
Zhengyuan Shi,
Zeju Li,
Chengyu Ma,
Yunhao Zhou,
Ziyang Zheng,
Jiawei Liu,
Hongyang Pan,
Lingfeng Zhou,
Kezhi Li,
Jiaying Zhu,
Lingwei Yan,
Zhiqiang He,
Chenhao Xue,
Wentao Jiang,
Fan Yang,
Guangyu Sun,
Xiaoyan Yang,
Gang Chen,
Chuan Shi,
Zhufei Chu,
Jun Yang,
Qiang Xu
Abstract:
We introduce ForgeEDA, an open-source comprehensive circuit dataset across various categories. ForgeEDA includes diverse circuit representations such as Register Transfer Level (RTL) code, Post-mapping (PM) netlists, And-Inverter Graphs (AIGs), and placed netlists, enabling comprehensive analysis and development. We demonstrate ForgeEDA's utility by benchmarking state-of-the-art EDA algorithms on…
▽ More
We introduce ForgeEDA, an open-source comprehensive circuit dataset across various categories. ForgeEDA includes diverse circuit representations such as Register Transfer Level (RTL) code, Post-mapping (PM) netlists, And-Inverter Graphs (AIGs), and placed netlists, enabling comprehensive analysis and development. We demonstrate ForgeEDA's utility by benchmarking state-of-the-art EDA algorithms on critical tasks such as Power, Performance, and Area (PPA) optimization, highlighting its ability to expose performance gaps and drive advancements. Additionally, ForgeEDA's scale and diversity facilitate the training of AI models for EDA tasks, demonstrating its potential to improve model performance and generalization. By addressing limitations in existing datasets, ForgeEDA aims to catalyze breakthroughs in modern IC design and support the next generation of innovations in EDA.
△ Less
Submitted 4 May, 2025;
originally announced May 2025.
-
Phasing Through the Flames: Rapid Motion Planning with the AGHF PDE for Arbitrary Objective Functions and Constraints
Authors:
Challen Enninful Adu,
César E. Ramos Chuquiure,
Yutong Zhou,
Pearl Lin,
Ruikai Yang,
Bohao Zhang,
Shubham Singh,
Ram Vasudevan
Abstract:
The generation of optimal trajectories for high-dimensional robotic systems under constraints remains computationally challenging due to the need to simultaneously satisfy dynamic feasibility, input limits, and task-specific objectives while searching over high-dimensional spaces. Recent approaches using the Affine Geometric Heat Flow (AGHF) Partial Differential Equation (PDE) have demonstrated pr…
▽ More
The generation of optimal trajectories for high-dimensional robotic systems under constraints remains computationally challenging due to the need to simultaneously satisfy dynamic feasibility, input limits, and task-specific objectives while searching over high-dimensional spaces. Recent approaches using the Affine Geometric Heat Flow (AGHF) Partial Differential Equation (PDE) have demonstrated promising results, generating dynamically feasible trajectories for complex systems like the Digit V3 humanoid within seconds. These methods efficiently solve trajectory optimization problems over a two-dimensional domain by evolving an initial trajectory to minimize control effort. However, these AGHF approaches are limited to a single type of optimal control problem (i.e., minimizing the integral of squared control norms) and typically require initial guesses that satisfy constraints to ensure satisfactory convergence. These limitations restrict the potential utility of the AGHF PDE especially when trying to synthesize trajectories for robotic systems. This paper generalizes the AGHF formulation to accommodate arbitrary cost functions, significantly expanding the classes of trajectories that can be generated. This work also introduces a Phase1 - Phase 2 Algorithm that enables the use of constraint-violating initial guesses while guaranteeing satisfactory convergence. The effectiveness of the proposed method is demonstrated through comparative evaluations against state-of-the-art techniques across various dynamical systems and challenging trajectory generation problems. Project Page: https://roahmlab.github.io/BLAZE/
△ Less
Submitted 2 May, 2025;
originally announced May 2025.
-
Machine Learning for Cyber-Attack Identification from Traffic Flows
Authors:
Yujing Zhou,
Marc L. Jacquet,
Robel Dawit,
Skyler Fabre,
Dev Sarawat,
Faheem Khan,
Madison Newell,
Yongxin Liu,
Dahai Liu,
Hongyun Chen,
Jian Wang,
Huihui Wang
Abstract:
This paper presents our simulation of cyber-attacks and detection strategies on the traffic control system in Daytona Beach, FL. using Raspberry Pi virtual machines and the OPNSense firewall, along with traffic dynamics from SUMO and exploitation via the Metasploit framework. We try to answer the research questions: are we able to identify cyber attacks by only analyzing traffic flow patterns. In…
▽ More
This paper presents our simulation of cyber-attacks and detection strategies on the traffic control system in Daytona Beach, FL. using Raspberry Pi virtual machines and the OPNSense firewall, along with traffic dynamics from SUMO and exploitation via the Metasploit framework. We try to answer the research questions: are we able to identify cyber attacks by only analyzing traffic flow patterns. In this research, the cyber attacks are focused particularly when lights are randomly turned all green or red at busy intersections by adversarial attackers. Despite challenges stemming from imbalanced data and overlapping traffic patterns, our best model shows 85\% accuracy when detecting intrusions purely using traffic flow statistics. Key indicators for successful detection included occupancy, jam length, and halting durations.
△ Less
Submitted 2 May, 2025;
originally announced May 2025.
-
Explainable Machine Learning for Cyberattack Identification from Traffic Flows
Authors:
Yujing Zhou,
Marc L. Jacquet,
Robel Dawit,
Skyler Fabre,
Dev Sarawat,
Faheem Khan,
Madison Newell,
Yongxin Liu,
Dahai Liu,
Hongyun Chen,
Jian Wang,
Huihui Wang
Abstract:
The increasing automation of traffic management systems has made them prime targets for cyberattacks, disrupting urban mobility and public safety. Traditional network-layer defenses are often inaccessible to transportation agencies, necessitating a machine learning-based approach that relies solely on traffic flow data. In this study, we simulate cyberattacks in a semi-realistic environment, using…
▽ More
The increasing automation of traffic management systems has made them prime targets for cyberattacks, disrupting urban mobility and public safety. Traditional network-layer defenses are often inaccessible to transportation agencies, necessitating a machine learning-based approach that relies solely on traffic flow data. In this study, we simulate cyberattacks in a semi-realistic environment, using a virtualized traffic network to analyze disruption patterns. We develop a deep learning-based anomaly detection system, demonstrating that Longest Stop Duration and Total Jam Distance are key indicators of compromised signals. To enhance interpretability, we apply Explainable AI (XAI) techniques, identifying critical decision factors and diagnosing misclassification errors. Our analysis reveals two primary challenges: transitional data inconsistencies, where mislabeled recovery-phase traffic misleads the model, and model limitations, where stealth attacks in low-traffic conditions evade detection. This work enhances AI-driven traffic security, improving both detection accuracy and trustworthiness in smart transportation systems.
△ Less
Submitted 2 May, 2025;
originally announced May 2025.
-
A Minimax-MDP Framework with Future-imposed Conditions for Learning-augmented Problems
Authors:
Xin Chen,
Yuze Chen,
Yuan Zhou
Abstract:
We study a class of sequential decision-making problems with augmented predictions, potentially provided by a machine learning algorithm. In this setting, the decision-maker receives prediction intervals for unknown parameters that become progressively refined over time, and seeks decisions that are competitive with the hindsight optimal under all possible realizations of both parameters and predi…
▽ More
We study a class of sequential decision-making problems with augmented predictions, potentially provided by a machine learning algorithm. In this setting, the decision-maker receives prediction intervals for unknown parameters that become progressively refined over time, and seeks decisions that are competitive with the hindsight optimal under all possible realizations of both parameters and predictions. We propose a minimax Markov Decision Process (minimax-MDP) framework, where the system state consists of an adversarially evolving environment state and an internal state controlled by the decision-maker. We introduce a set of future-imposed conditions that characterize the feasibility of minimax-MDPs and enable the design of efficient, often closed-form, robustly competitive policies. We illustrate the framework through three applications: multi-period inventory ordering with refining demand predictions, resource allocation with uncertain utility functions, and a multi-phase extension of the minimax-MDP applied to the inventory problem with time-varying ordering costs. Our results provide a tractable and versatile approach to robust online decision-making under predictive uncertainty.
△ Less
Submitted 1 May, 2025;
originally announced May 2025.
-
A Survey on Large Language Model based Human-Agent Systems
Authors:
Henry Peng Zou,
Wei-Chieh Huang,
Yaozu Wu,
Yankai Chen,
Chunyu Miao,
Hoang Nguyen,
Yue Zhou,
Weizhi Zhang,
Liancheng Fang,
Langzhou He,
Yangning Li,
Yuwei Cao,
Dongyuan Li,
Renhe Jiang,
Philip S. Yu
Abstract:
Recent advances in large language models (LLMs) have sparked growing interest in building fully autonomous agents. However, fully autonomous LLM-based agents still face significant challenges, including limited reliability due to hallucinations, difficulty in handling complex tasks, and substantial safety and ethical risks, all of which limit their feasibility and trustworthiness in real-world app…
▽ More
Recent advances in large language models (LLMs) have sparked growing interest in building fully autonomous agents. However, fully autonomous LLM-based agents still face significant challenges, including limited reliability due to hallucinations, difficulty in handling complex tasks, and substantial safety and ethical risks, all of which limit their feasibility and trustworthiness in real-world applications. To overcome these limitations, LLM-based human-agent systems (LLM-HAS) incorporate human-provided information, feedback, or control into the agent system to enhance system performance, reliability and safety. This paper provides the first comprehensive and structured survey of LLM-HAS. It clarifies fundamental concepts, systematically presents core components shaping these systems, including environment & profiling, human feedback, interaction types, orchestration and communication, explores emerging applications, and discusses unique challenges and opportunities. By consolidating current knowledge and offering a structured overview, we aim to foster further research and innovation in this rapidly evolving interdisciplinary field. Paper lists and resources are available at https://github.com/HenryPengZou/Awesome-LLM-Based-Human-Agent-System-Papers.
△ Less
Submitted 1 May, 2025;
originally announced May 2025.
-
From GNNs to Trees: Multi-Granular Interpretability for Graph Neural Networks
Authors:
Jie Yang,
Yuwen Wang,
Kaixuan Chen,
Tongya Zheng,
Yihe Zhou,
Zhenbang Xiao,
Ji Cao,
Mingli Song,
Shunyu Liu
Abstract:
Interpretable Graph Neural Networks (GNNs) aim to reveal the underlying reasoning behind model predictions, attributing their decisions to specific subgraphs that are informative. However, existing subgraph-based interpretable methods suffer from an overemphasis on local structure, potentially overlooking long-range dependencies within the entire graphs. Although recent efforts that rely on graph…
▽ More
Interpretable Graph Neural Networks (GNNs) aim to reveal the underlying reasoning behind model predictions, attributing their decisions to specific subgraphs that are informative. However, existing subgraph-based interpretable methods suffer from an overemphasis on local structure, potentially overlooking long-range dependencies within the entire graphs. Although recent efforts that rely on graph coarsening have proven beneficial for global interpretability, they inevitably reduce the graphs to a fixed granularity. Such an inflexible way can only capture graph connectivity at a specific level, whereas real-world graph tasks often exhibit relationships at varying granularities (e.g., relevant interactions in proteins span from functional groups, to amino acids, and up to protein domains). In this paper, we introduce a novel Tree-like Interpretable Framework (TIF) for graph classification, where plain GNNs are transformed into hierarchical trees, with each level featuring coarsened graphs of different granularity as tree nodes. Specifically, TIF iteratively adopts a graph coarsening module to compress original graphs (i.e., root nodes of trees) into increasingly coarser ones (i.e., child nodes of trees), while preserving diversity among tree nodes within different branches through a dedicated graph perturbation module. Finally, we propose an adaptive routing module to identify the most informative root-to-leaf paths, providing not only the final prediction but also the multi-granular interpretability for the decision-making process. Extensive experiments on the graph classification benchmarks with both synthetic and real-world datasets demonstrate the superiority of TIF in interpretability, while also delivering a competitive prediction performance akin to the state-of-the-art counterparts.
△ Less
Submitted 1 May, 2025;
originally announced May 2025.