-
SPATIA: Multimodal Model for Prediction and Generation of Spatial Cell Phenotypes
Authors:
Zhenglun Kong,
Mufan Qiu,
John Boesen,
Xiang Lin,
Sukwon Yun,
Tianlong Chen,
Manolis Kellis,
Marinka Zitnik
Abstract:
Understanding how cellular morphology, gene expression, and spatial organization jointly shape tissue function is a central challenge in biology. Image-based spatial transcriptomics technologies now provide high-resolution measurements of cell images and gene expression profiles, but machine learning methods typically analyze these modalities in isolation or at limited resolution. We address the p…
▽ More
Understanding how cellular morphology, gene expression, and spatial organization jointly shape tissue function is a central challenge in biology. Image-based spatial transcriptomics technologies now provide high-resolution measurements of cell images and gene expression profiles, but machine learning methods typically analyze these modalities in isolation or at limited resolution. We address the problem of learning unified, spatially aware representations that integrate cell morphology, gene expression, and spatial context across biological scales. This requires models that can operate at single-cell resolution, reason across spatial neighborhoods, and generalize to whole-slide tissue organization. Here, we introduce SPATIA, a multi-scale generative and predictive model for spatial transcriptomics. SPATIA learns cell-level embeddings by fusing image-derived morphological tokens and transcriptomic vector tokens using cross-attention and then aggregates them at niche and tissue levels using transformer modules to capture spatial dependencies. SPATIA incorporates token merging in its generative diffusion decoder to synthesize high-resolution cell images conditioned on gene expression. We assembled a multi-scale dataset consisting of 17 million cell-gene pairs, 1 million niche-gene pairs, and 10,000 tissue-gene pairs across 49 donors, 17 tissue types, and 12 disease states. We benchmark SPATIA against 13 existing models across 12 individual tasks, which span several categories including cell annotation, cell clustering, gene imputation, cross-modal prediction, and image generation. SPATIA achieves improved performance over all baselines and generates realistic cell morphologies that reflect transcriptomic perturbations.
△ Less
Submitted 7 July, 2025;
originally announced July 2025.
-
DAM-VSR: Disentanglement of Appearance and Motion for Video Super-Resolution
Authors:
Zhe Kong,
Le Li,
Yong Zhang,
Feng Gao,
Shaoshu Yang,
Tao Wang,
Kaihao Zhang,
Zhuoliang Kang,
Xiaoming Wei,
Guanying Chen,
Wenhan Luo
Abstract:
Real-world video super-resolution (VSR) presents significant challenges due to complex and unpredictable degradations. Although some recent methods utilize image diffusion models for VSR and have shown improved detail generation capabilities, they still struggle to produce temporally consistent frames. We attempt to use Stable Video Diffusion (SVD) combined with ControlNet to address this issue. H…
▽ More
Real-world video super-resolution (VSR) presents significant challenges due to complex and unpredictable degradations. Although some recent methods utilize image diffusion models for VSR and have shown improved detail generation capabilities, they still struggle to produce temporally consistent frames. We attempt to use Stable Video Diffusion (SVD) combined with ControlNet to address this issue. However, due to the intrinsic image-animation characteristics of SVD, it is challenging to generate fine details using only low-quality videos. To tackle this problem, we propose DAM-VSR, an appearance and motion disentanglement framework for VSR. This framework disentangles VSR into appearance enhancement and motion control problems. Specifically, appearance enhancement is achieved through reference image super-resolution, while motion control is achieved through video ControlNet. This disentanglement fully leverages the generative prior of video diffusion models and the detail generation capabilities of image super-resolution models. Furthermore, equipped with the proposed motion-aligned bidirectional sampling strategy, DAM-VSR can conduct VSR on longer input videos. DAM-VSR achieves state-of-the-art performance on real-world data and AIGC data, demonstrating its powerful detail generation capabilities.
△ Less
Submitted 1 July, 2025;
originally announced July 2025.
-
Token Transforming: A Unified and Training-Free Token Compression Framework for Vision Transformer Acceleration
Authors:
Fanhu Zeng,
Deli Yu,
Zhenglun Kong,
Hao Tang
Abstract:
Vision transformers have been widely explored in various vision tasks. Due to heavy computational cost, much interest has aroused for compressing vision transformer dynamically in the aspect of tokens. Current methods mainly pay attention to token pruning or merging to reduce token numbers, in which tokens are compressed exclusively, causing great information loss and therefore post-training is in…
▽ More
Vision transformers have been widely explored in various vision tasks. Due to heavy computational cost, much interest has aroused for compressing vision transformer dynamically in the aspect of tokens. Current methods mainly pay attention to token pruning or merging to reduce token numbers, in which tokens are compressed exclusively, causing great information loss and therefore post-training is inevitably required to recover the performance. In this paper, we rethink token reduction and unify the process as an explicit form of token matrix transformation, in which all existing methods are constructing special forms of matrices within the framework. Furthermore, we propose a many-to-many Token Transforming framework that serves as a generalization of all existing methods and reserves the most information, even enabling training-free acceleration. We conduct extensive experiments to validate our framework. Specifically, we reduce 40% FLOPs and accelerate DeiT-S by $\times$1.5 with marginal 0.1% accuracy drop. Furthermore, we extend the method to dense prediction tasks including segmentation, object detection, depth estimation, and language model generation. Results demonstrate that the proposed method consistently achieves substantial improvements, offering a better computation-performance trade-off, impressive budget reduction and inference acceleration.
△ Less
Submitted 5 June, 2025;
originally announced June 2025.
-
Enabling Flexible Multi-LLM Integration for Scalable Knowledge Aggregation
Authors:
Zhenglun Kong,
Zheng Zhan,
Shiyue Hou,
Yifan Gong,
Xin Meng,
Pengwei Sui,
Peiyan Dong,
Xuan Shen,
Zifeng Wang,
Pu Zhao,
Hao Tang,
Stratis Ioannidis,
Yanzhi Wang
Abstract:
Large language models (LLMs) have shown remarkable promise but remain challenging to continually improve through traditional finetuning, particularly when integrating capabilities from other specialized LLMs. Popular methods like ensemble and weight merging require substantial memory and struggle to adapt to changing data environments. Recent efforts have transferred knowledge from multiple LLMs i…
▽ More
Large language models (LLMs) have shown remarkable promise but remain challenging to continually improve through traditional finetuning, particularly when integrating capabilities from other specialized LLMs. Popular methods like ensemble and weight merging require substantial memory and struggle to adapt to changing data environments. Recent efforts have transferred knowledge from multiple LLMs into a single target model; however, they suffer from interference and degraded performance among tasks, largely due to limited flexibility in candidate selection and training pipelines. To address these issues, we propose a framework that adaptively selects and aggregates knowledge from diverse LLMs to build a single, stronger model, avoiding the high memory overhead of ensemble and inflexible weight merging. Specifically, we design an adaptive selection network that identifies the most relevant source LLMs based on their scores, thereby reducing knowledge interference. We further propose a dynamic weighted fusion strategy that accounts for the inherent strengths of candidate LLMs, along with a feedback-driven loss function that prevents the selector from converging on a single subset of sources. Experimental results demonstrate that our method can enable a more stable and scalable knowledge aggregation process while reducing knowledge interference by up to 50% compared to existing approaches. Code is avaliable at https://github.com/ZLKong/LLM_Integration
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation
Authors:
Zhe Kong,
Feng Gao,
Yong Zhang,
Zhuoliang Kang,
Xiaoming Wei,
Xunliang Cai,
Guanying Chen,
Wenhan Luo
Abstract:
Audio-driven human animation methods, such as talking head and talking body generation, have made remarkable progress in generating synchronized facial movements and appealing visual quality videos. However, existing methods primarily focus on single human animation and struggle with multi-stream audio inputs, facing incorrect binding problems between audio and persons. Additionally, they exhibit…
▽ More
Audio-driven human animation methods, such as talking head and talking body generation, have made remarkable progress in generating synchronized facial movements and appealing visual quality videos. However, existing methods primarily focus on single human animation and struggle with multi-stream audio inputs, facing incorrect binding problems between audio and persons. Additionally, they exhibit limitations in instruction-following capabilities. To solve this problem, in this paper, we propose a novel task: Multi-Person Conversational Video Generation, and introduce a new framework, MultiTalk, to address the challenges during multi-person generation. Specifically, for audio injection, we investigate several schemes and propose the Label Rotary Position Embedding (L-RoPE) method to resolve the audio and person binding problem. Furthermore, during training, we observe that partial parameter training and multi-task training are crucial for preserving the instruction-following ability of the base model. MultiTalk achieves superior performance compared to other methods on several datasets, including talking head, talking body, and multi-person datasets, demonstrating the powerful generation capabilities of our approach.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
ACE: Exploring Activation Cosine Similarity and Variance for Accurate and Calibration-Efficient LLM Pruning
Authors:
Zhendong Mi,
Zhenglun Kong,
Geng Yuan,
Shaoyi Huang
Abstract:
With the rapid expansion of large language models (LLMs), the demand for memory and computational resources has grown significantly. Recent advances in LLM pruning aim to reduce the size and computational cost of these models. However, existing methods often suffer from either suboptimal pruning performance or low time efficiency during the pruning process. In this work, we propose an efficient an…
▽ More
With the rapid expansion of large language models (LLMs), the demand for memory and computational resources has grown significantly. Recent advances in LLM pruning aim to reduce the size and computational cost of these models. However, existing methods often suffer from either suboptimal pruning performance or low time efficiency during the pruning process. In this work, we propose an efficient and effective pruning method that simultaneously achieves high pruning performance and fast pruning speed with improved calibration efficiency. Our approach introduces two key innovations: (1) An activation cosine similarity loss-guided pruning metric, which considers the angular deviation of the output activation between the dense and pruned models. (2) An activation variance-guided pruning metric, which helps preserve semantic distinctions in output activations after pruning, enabling effective pruning with shorter input sequences. These two components can be readily combined to enhance LLM pruning in both accuracy and efficiency. Experimental results show that our method achieves up to an 18% reduction in perplexity and up to 63% decrease in pruning time on prevalent LLMs such as LLaMA, LLaMA-2, and OPT.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality
Authors:
Zhenglun Kong,
Yize Li,
Fanhu Zeng,
Lei Xin,
Shvat Messica,
Xue Lin,
Pu Zhao,
Manolis Kellis,
Hao Tang,
Marinka Zitnik
Abstract:
In Transformer architectures, tokens\textemdash discrete units derived from raw data\textemdash are formed by segmenting inputs into fixed-length chunks. Each token is then mapped to an embedding, enabling parallel attention computations while preserving the input's essential information. Due to the quadratic computational complexity of transformer self-attention mechanisms, token reduction has pr…
▽ More
In Transformer architectures, tokens\textemdash discrete units derived from raw data\textemdash are formed by segmenting inputs into fixed-length chunks. Each token is then mapped to an embedding, enabling parallel attention computations while preserving the input's essential information. Due to the quadratic computational complexity of transformer self-attention mechanisms, token reduction has primarily been used as an efficiency strategy. This is especially true in single vision and language domains, where it helps balance computational costs, memory usage, and inference latency. Despite these advances, this paper argues that token reduction should transcend its traditional efficiency-oriented role in the era of large generative models. Instead, we position it as a fundamental principle in generative modeling, critically influencing both model architecture and broader applications. Specifically, we contend that across vision, language, and multimodal systems, token reduction can: (i) facilitate deeper multimodal integration and alignment, (ii) mitigate "overthinking" and hallucinations, (iii) maintain coherence over long inputs, and (iv) enhance training stability, etc. We reframe token reduction as more than an efficiency measure. By doing so, we outline promising future directions, including algorithm design, reinforcement learning-guided token reduction, token optimization for in-context learning, and broader ML and scientific domains. We highlight its potential to drive new model architectures and learning strategies that improve robustness, increase interpretability, and better align with the objectives of generative modeling.
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
Structured Agent Distillation for Large Language Model
Authors:
Jun Liu,
Zhenglun Kong,
Peiyan Dong,
Changdi Yang,
Tianqi Li,
Hao Tang,
Geng Yuan,
Wei Niu,
Wenbin Zhang,
Pu Zhao,
Xue Lin,
Dong Huang,
Yanzhi Wang
Abstract:
Large language models (LLMs) exhibit strong capabilities as decision-making agents by interleaving reasoning and actions, as seen in ReAct-style frameworks. Yet, their practical deployment is constrained by high inference costs and large model sizes. We propose Structured Agent Distillation, a framework that compresses large LLM-based agents into smaller student models while preserving both reason…
▽ More
Large language models (LLMs) exhibit strong capabilities as decision-making agents by interleaving reasoning and actions, as seen in ReAct-style frameworks. Yet, their practical deployment is constrained by high inference costs and large model sizes. We propose Structured Agent Distillation, a framework that compresses large LLM-based agents into smaller student models while preserving both reasoning fidelity and action consistency. Unlike standard token-level distillation, our method segments trajectories into {[REASON]} and {[ACT]} spans, applying segment-specific losses to align each component with the teacher's behavior. This structure-aware supervision enables compact agents to better replicate the teacher's decision process. Experiments on ALFWorld, HotPotQA-ReAct, and WebShop show that our approach consistently outperforms token-level and imitation learning baselines, achieving significant compression with minimal performance drop. Scaling and ablation results further highlight the importance of span-level alignment for efficient and deployable agents.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
Implet: A Post-hoc Subsequence Explainer for Time Series Models
Authors:
Fanyu Meng,
Ziwen Kan,
Shahbaz Rezaei,
Zhaodan Kong,
Xin Chen,
Xin Liu
Abstract:
Explainability in time series models is crucial for fostering trust, facilitating debugging, and ensuring interpretability in real-world applications. In this work, we introduce Implet, a novel post-hoc explainer that generates accurate and concise subsequence-level explanations for time series models. Our approach identifies critical temporal segments that significantly contribute to the model's…
▽ More
Explainability in time series models is crucial for fostering trust, facilitating debugging, and ensuring interpretability in real-world applications. In this work, we introduce Implet, a novel post-hoc explainer that generates accurate and concise subsequence-level explanations for time series models. Our approach identifies critical temporal segments that significantly contribute to the model's predictions, providing enhanced interpretability beyond traditional feature-attribution methods. Based on it, we propose a cohort-based (group-level) explanation framework designed to further improve the conciseness and interpretability of our explanations. We evaluate Implet on several standard time-series classification benchmarks, demonstrating its effectiveness in improving interpretability. The code is available at https://github.com/LbzSteven/implet
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Multi-Domain Audio Question Answering Toward Acoustic Content Reasoning in The DCASE 2025 Challenge
Authors:
Chao-Han Huck Yang,
Sreyan Ghosh,
Qing Wang,
Jaeyeon Kim,
Hengyi Hong,
Sonal Kumar,
Guirui Zhong,
Zhifeng Kong,
S Sakshi,
Vaibhavi Lokegaonkar,
Oriol Nieto,
Ramani Duraiswami,
Dinesh Manocha,
Gunhee Kim,
Jun Du,
Rafael Valle,
Bryan Catanzaro
Abstract:
We present Task 5 of the DCASE 2025 Challenge: an Audio Question Answering (AQA) benchmark spanning multiple domains of sound understanding. This task defines three QA subsets (Bioacoustics, Temporal Soundscapes, and Complex QA) to test audio-language models on interactive question-answering over diverse acoustic scenes. We describe the dataset composition (from marine mammal calls to soundscapes…
▽ More
We present Task 5 of the DCASE 2025 Challenge: an Audio Question Answering (AQA) benchmark spanning multiple domains of sound understanding. This task defines three QA subsets (Bioacoustics, Temporal Soundscapes, and Complex QA) to test audio-language models on interactive question-answering over diverse acoustic scenes. We describe the dataset composition (from marine mammal calls to soundscapes and complex real-world clips), the evaluation protocol (top-1 accuracy with answer-shuffling robustness), and baseline systems (Qwen2-Audio-7B, AudioFlamingo 2, Gemini-2-Flash). Preliminary results on the development set are compared, showing strong variation across models and subsets. This challenge aims to advance the audio understanding and reasoning capabilities of audio-language models toward human-level acuity, which are crucial for enabling AI agents to perceive and interact about the world effectively.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
Revisiting Radar Camera Alignment by Contrastive Learning for 3D Object Detection
Authors:
Linhua Kong,
Dongxia Chang,
Lian Liu,
Zisen Kong,
Pengyuan Li,
Yao Zhao
Abstract:
Recently, 3D object detection algorithms based on radar and camera fusion have shown excellent performance, setting the stage for their application in autonomous driving perception tasks. Existing methods have focused on dealing with feature misalignment caused by the domain gap between radar and camera. However, existing methods either neglect inter-modal features interaction during alignment or…
▽ More
Recently, 3D object detection algorithms based on radar and camera fusion have shown excellent performance, setting the stage for their application in autonomous driving perception tasks. Existing methods have focused on dealing with feature misalignment caused by the domain gap between radar and camera. However, existing methods either neglect inter-modal features interaction during alignment or fail to effectively align features at the same spatial location across modalities. To alleviate the above problems, we propose a new alignment model called Radar Camera Alignment (RCAlign). Specifically, we design a Dual-Route Alignment (DRA) module based on contrastive learning to align and fuse the features between radar and camera. Moreover, considering the sparsity of radar BEV features, a Radar Feature Enhancement (RFE) module is proposed to improve the densification of radar BEV features with the knowledge distillation loss. Experiments show RCAlign achieves a new state-of-the-art on the public nuScenes benchmark in radar camera fusion for 3D Object Detection. Furthermore, the RCAlign achieves a significant performance gain (4.3\% NDS and 8.4\% mAP) in real-time 3D detection compared to the latest state-of-the-art method (RCBEVDet).
△ Less
Submitted 22 April, 2025;
originally announced April 2025.
-
ProtFlow: Fast Protein Sequence Design via Flow Matching on Compressed Protein Language Model Embeddings
Authors:
Zitai Kong,
Yiheng Zhu,
Yinlong Xu,
Hanjing Zhou,
Mingzhe Yin,
Jialu Wu,
Hongxia Xu,
Chang-Yu Hsieh,
Tingjun Hou,
Jian Wu
Abstract:
The design of protein sequences with desired functionalities is a fundamental task in protein engineering. Deep generative methods, such as autoregressive models and diffusion models, have greatly accelerated the discovery of novel protein sequences. However, these methods mainly focus on local or shallow residual semantics and suffer from low inference efficiency, large modeling space and high tr…
▽ More
The design of protein sequences with desired functionalities is a fundamental task in protein engineering. Deep generative methods, such as autoregressive models and diffusion models, have greatly accelerated the discovery of novel protein sequences. However, these methods mainly focus on local or shallow residual semantics and suffer from low inference efficiency, large modeling space and high training cost. To address these challenges, we introduce ProtFlow, a fast flow matching-based protein sequence design framework that operates on embeddings derived from semantically meaningful latent space of protein language models. By compressing and smoothing the latent space, ProtFlow enhances performance while training on limited computational resources. Leveraging reflow techniques, ProtFlow enables high-quality single-step sequence generation. Additionally, we develop a joint design pipeline for the design scene of multichain proteins. We evaluate ProtFlow across diverse protein design tasks, including general peptides and long-chain proteins, antimicrobial peptides, and antibodies. Experimental results demonstrate that ProtFlow outperforms task-specific methods in these applications, underscoring its potential and broad applicability in computational protein sequence design and analysis.
△ Less
Submitted 15 April, 2025;
originally announced April 2025.
-
Efficient Calibration for RRAM-based In-Memory Computing using DoRA
Authors:
Weirong Dong,
Kai Zhou,
Zhen Kong,
Quan Cheng,
Junkai Huang,
Zhengke Yang,
Masanori Hashimoto,
Longyang Lin
Abstract:
Resistive In-Memory Computing (RIMC) offers ultra-efficient computation for edge AI but faces accuracy degradation due to RRAM conductance drift over time. Traditional retraining methods are limited by RRAM's high energy consumption, write latency, and endurance constraints. We propose a DoRA-based calibration framework that restores accuracy by compensating influential weights with minimal calibr…
▽ More
Resistive In-Memory Computing (RIMC) offers ultra-efficient computation for edge AI but faces accuracy degradation due to RRAM conductance drift over time. Traditional retraining methods are limited by RRAM's high energy consumption, write latency, and endurance constraints. We propose a DoRA-based calibration framework that restores accuracy by compensating influential weights with minimal calibration parameters stored in SRAM, leaving RRAM weights untouched. This eliminates in-field RRAM writes, ensuring energy-efficient, fast, and reliable calibration. Experiments on RIMC-based ResNet50 (ImageNet-1K) demonstrate 69.53% accuracy restoration using just 10 calibration samples while updating only 2.34% of parameters.
△ Less
Submitted 2 April, 2025;
originally announced April 2025.
-
MG-MotionLLM: A Unified Framework for Motion Comprehension and Generation across Multiple Granularities
Authors:
Bizhu Wu,
Jinheng Xie,
Keming Shen,
Zhe Kong,
Jianfeng Ren,
Ruibin Bai,
Rong Qu,
Linlin Shen
Abstract:
Recent motion-aware large language models have demonstrated promising potential in unifying motion comprehension and generation. However, existing approaches primarily focus on coarse-grained motion-text modeling, where text describes the overall semantics of an entire motion sequence in just a few words. This limits their ability to handle fine-grained motion-relevant tasks, such as understanding…
▽ More
Recent motion-aware large language models have demonstrated promising potential in unifying motion comprehension and generation. However, existing approaches primarily focus on coarse-grained motion-text modeling, where text describes the overall semantics of an entire motion sequence in just a few words. This limits their ability to handle fine-grained motion-relevant tasks, such as understanding and controlling the movements of specific body parts. To overcome this limitation, we pioneer MG-MotionLLM, a unified motion-language model for multi-granular motion comprehension and generation. We further introduce a comprehensive multi-granularity training scheme by incorporating a set of novel auxiliary tasks, such as localizing temporal boundaries of motion segments via detailed text as well as motion detailed captioning, to facilitate mutual reinforcement for motion-text modeling across various levels of granularity. Extensive experiments show that our MG-MotionLLM achieves superior performance on classical text-to-motion and motion-to-text tasks, and exhibits potential in novel fine-grained motion comprehension and editing tasks. Project page: CVI-SZU/MG-MotionLLM
△ Less
Submitted 3 April, 2025;
originally announced April 2025.
-
Improved Visual-Spatial Reasoning via R1-Zero-Like Training
Authors:
Zhenyi Liao,
Qingsong Xie,
Yanhao Zhang,
Zijian Kong,
Haonan Lu,
Zhenyu Yang,
Zhijie Deng
Abstract:
Increasing attention has been placed on improving the reasoning capacities of multi-modal large language models (MLLMs). As the cornerstone for AI agents that function in the physical realm, video-based visual-spatial intelligence (VSI) emerges as one of the most pivotal reasoning capabilities of MLLMs. This work conducts a first, in-depth study on improving the visual-spatial reasoning of MLLMs v…
▽ More
Increasing attention has been placed on improving the reasoning capacities of multi-modal large language models (MLLMs). As the cornerstone for AI agents that function in the physical realm, video-based visual-spatial intelligence (VSI) emerges as one of the most pivotal reasoning capabilities of MLLMs. This work conducts a first, in-depth study on improving the visual-spatial reasoning of MLLMs via R1-Zero-like training. Technically, we first identify that the visual-spatial reasoning capacities of small- to medium-sized Qwen2-VL models cannot be activated via Chain of Thought (CoT) prompts. We then incorporate GRPO training for improved visual-spatial reasoning, using the carefully curated VSI-100k dataset, following DeepSeek-R1-Zero. During the investigation, we identify the necessity to keep the KL penalty (even with a small value) in GRPO. With just 120 GPU hours, our vsGRPO-2B model, fine-tuned from Qwen2-VL-2B, can outperform the base model by 12.1% and surpass GPT-4o. Moreover, our vsGRPO-7B model, fine-tuned from Qwen2-VL-7B, achieves performance comparable to that of the best open-source model LLaVA-NeXT-Video-72B. Additionally, we compare vsGRPO to supervised fine-tuning and direct preference optimization baselines and observe strong performance superiority. The code and dataset will be available soon.
△ Less
Submitted 14 April, 2025; v1 submitted 1 April, 2025;
originally announced April 2025.
-
TxAgent: An AI Agent for Therapeutic Reasoning Across a Universe of Tools
Authors:
Shanghua Gao,
Richard Zhu,
Zhenglun Kong,
Ayush Noori,
Xiaorui Su,
Curtis Ginder,
Theodoros Tsiligkaridis,
Marinka Zitnik
Abstract:
Precision therapeutics require multimodal adaptive models that generate personalized treatment recommendations. We introduce TxAgent, an AI agent that leverages multi-step reasoning and real-time biomedical knowledge retrieval across a toolbox of 211 tools to analyze drug interactions, contraindications, and patient-specific treatment strategies. TxAgent evaluates how drugs interact at molecular,…
▽ More
Precision therapeutics require multimodal adaptive models that generate personalized treatment recommendations. We introduce TxAgent, an AI agent that leverages multi-step reasoning and real-time biomedical knowledge retrieval across a toolbox of 211 tools to analyze drug interactions, contraindications, and patient-specific treatment strategies. TxAgent evaluates how drugs interact at molecular, pharmacokinetic, and clinical levels, identifies contraindications based on patient comorbidities and concurrent medications, and tailors treatment strategies to individual patient characteristics. It retrieves and synthesizes evidence from multiple biomedical sources, assesses interactions between drugs and patient conditions, and refines treatment recommendations through iterative reasoning. It selects tools based on task objectives and executes structured function calls to solve therapeutic tasks that require clinical reasoning and cross-source validation. The ToolUniverse consolidates 211 tools from trusted sources, including all US FDA-approved drugs since 1939 and validated clinical insights from Open Targets. TxAgent outperforms leading LLMs, tool-use models, and reasoning agents across five new benchmarks: DrugPC, BrandPC, GenericPC, TreatmentPC, and DescriptionPC, covering 3,168 drug reasoning tasks and 456 personalized treatment scenarios. It achieves 92.1% accuracy in open-ended drug reasoning tasks, surpassing GPT-4o and outperforming DeepSeek-R1 (671B) in structured multi-step reasoning. TxAgent generalizes across drug name variants and descriptions. By integrating multi-step inference, real-time knowledge grounding, and tool-assisted decision-making, TxAgent ensures that treatment recommendations align with established clinical guidelines and real-world evidence, reducing the risk of adverse events and improving therapeutic decision-making.
△ Less
Submitted 13 March, 2025;
originally announced March 2025.
-
Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities
Authors:
Sreyan Ghosh,
Zhifeng Kong,
Sonal Kumar,
S Sakshi,
Jaehyeon Kim,
Wei Ping,
Rafael Valle,
Dinesh Manocha,
Bryan Catanzaro
Abstract:
Understanding and reasoning over non-speech sounds and music are crucial for both humans and AI agents to interact effectively with their environments. In this paper, we introduce Audio Flamingo 2 (AF2), an Audio-Language Model (ALM) with advanced audio understanding and reasoning capabilities. AF2 leverages (i) a custom CLAP model, (ii) synthetic Audio QA data for fine-grained audio reasoning, an…
▽ More
Understanding and reasoning over non-speech sounds and music are crucial for both humans and AI agents to interact effectively with their environments. In this paper, we introduce Audio Flamingo 2 (AF2), an Audio-Language Model (ALM) with advanced audio understanding and reasoning capabilities. AF2 leverages (i) a custom CLAP model, (ii) synthetic Audio QA data for fine-grained audio reasoning, and (iii) a multi-stage curriculum learning strategy. AF2 achieves state-of-the-art performance with only a 3B parameter small language model, surpassing large open-source and proprietary models across over 20 benchmarks. Next, for the first time, we extend audio understanding to long audio segments (30 secs to 5 mins) and propose LongAudio, a large and novel dataset for training ALMs on long audio captioning and question-answering tasks. Fine-tuning AF2 on LongAudio leads to exceptional performance on our proposed LongAudioBench, an expert annotated benchmark for evaluating ALMs on long audio understanding capabilities. We conduct extensive ablation studies to confirm the efficacy of our approach. Project Website: https://research.nvidia.com/labs/adlr/AF2/.
△ Less
Submitted 5 March, 2025;
originally announced March 2025.
-
MIND: Towards Immersive Psychological Healing with Multi-agent Inner Dialogue
Authors:
Yujia Chen,
Changsong Li,
Yiming Wang,
Qingqing Xiao,
Nan Zhang,
Zifan Kong,
Peng Wang,
Binyu Yan
Abstract:
Mental health issues are worsening in today's competitive society, such as depression and anxiety. Traditional healings like counseling and chatbots fail to engage effectively, they often provide generic responses lacking emotional depth. Although large language models (LLMs) have the potential to create more human-like interactions, they still struggle to capture subtle emotions. This requires LL…
▽ More
Mental health issues are worsening in today's competitive society, such as depression and anxiety. Traditional healings like counseling and chatbots fail to engage effectively, they often provide generic responses lacking emotional depth. Although large language models (LLMs) have the potential to create more human-like interactions, they still struggle to capture subtle emotions. This requires LLMs to be equipped with human-like adaptability and warmth. To fill this gap, we propose the MIND (Multi-agent INner Dialogue), a novel paradigm that provides more immersive psychological healing environments. Considering the strong generative and role-playing ability of LLM agents, we predefine an interactive healing framework and assign LLM agents different roles within the framework to engage in interactive inner dialogues with users, thereby providing an immersive healing experience. We conduct extensive human experiments in various real-world healing dimensions, and find that MIND provides a more user-friendly experience than traditional paradigms. This demonstrates that MIND effectively leverages the significant potential of LLMs in psychological healing.
△ Less
Submitted 27 February, 2025;
originally announced February 2025.
-
Narrative-Driven Travel Planning: Geoculturally-Grounded Script Generation with Evolutionary Itinerary Optimization
Authors:
Ziyu Zhang,
Ran Ding,
Ying Zhu,
Ziqian Kong,
Peilan Xu
Abstract:
To enhance tourists' experiences and immersion, this paper proposes a narrative-driven travel planning framework called NarrativeGuide, which generates a geoculturally-grounded narrative script for travelers, offering a novel, role-playing experience for their journey. In the initial stage, NarrativeGuide constructs a knowledge graph for attractions within a city, then configures the worldview, ch…
▽ More
To enhance tourists' experiences and immersion, this paper proposes a narrative-driven travel planning framework called NarrativeGuide, which generates a geoculturally-grounded narrative script for travelers, offering a novel, role-playing experience for their journey. In the initial stage, NarrativeGuide constructs a knowledge graph for attractions within a city, then configures the worldview, character setting, and exposition based on the knowledge graph. Using this foundation, the knowledge graph is combined to generate an independent scene unit for each attraction. During the itinerary planning stage, NarrativeGuide models narrative-driven travel planning as an optimization problem, utilizing a genetic algorithm (GA) to refine the itinerary. Before evaluating the candidate itinerary, transition scripts are generated for each pair of adjacent attractions, which, along with the scene units, form a complete script. The weighted sum of script coherence, travel time, and attraction scores is then used as the fitness value to update the candidate solution set. In our experiments, we incorporated the TravelPlanner benchmark to systematically evaluate the planning capability of NarrativeGuide under complex constraints. In addition, we assessed its performance in terms of narrative coherence and cultural fit. The results show that NarrativeGuide demonstrates strong capabilities in both itinerary planning and script generation.
△ Less
Submitted 8 June, 2025; v1 submitted 20 February, 2025;
originally announced February 2025.
-
Chinese Spelling Correction: A Comprehensive Survey of Progress, Challenges, and Opportunities
Authors:
Changchun Liu,
Kai Zhang,
Junzhe Jiang,
Zixiao Kong,
Qi Liu,
Enhong Chen
Abstract:
Chinese Spelling Correction (CSC) is a critical task in natural language processing, aimed at detecting and correcting spelling errors in Chinese text. This survey provides a comprehensive overview of CSC, tracing its evolution from pre-trained language models to large language models, and critically analyzing their respective strengths and weaknesses in this domain. Moreover, we further present a…
▽ More
Chinese Spelling Correction (CSC) is a critical task in natural language processing, aimed at detecting and correcting spelling errors in Chinese text. This survey provides a comprehensive overview of CSC, tracing its evolution from pre-trained language models to large language models, and critically analyzing their respective strengths and weaknesses in this domain. Moreover, we further present a detailed examination of existing benchmark datasets, highlighting their inherent challenges and limitations. Finally, we propose promising future research directions, particularly focusing on leveraging the potential of LLMs and their reasoning capabilities for improved CSC performance. To the best of our knowledge, this is the first comprehensive survey dedicated to the field of CSC. We believe this work will serve as a valuable resource for researchers, fostering a deeper understanding of the field and inspiring future advancements.
△ Less
Submitted 17 February, 2025;
originally announced February 2025.
-
A2SB: Audio-to-Audio Schrodinger Bridges
Authors:
Zhifeng Kong,
Kevin J Shih,
Weili Nie,
Arash Vahdat,
Sang-gil Lee,
Joao Felipe Santos,
Ante Jukic,
Rafael Valle,
Bryan Catanzaro
Abstract:
Audio in the real world may be perturbed due to numerous factors, causing the audio quality to be degraded. The following work presents an audio restoration model tailored for high-res music at 44.1kHz. Our model, Audio-to-Audio Schrodinger Bridges (A2SB), is capable of both bandwidth extension (predicting high-frequency components) and inpainting (re-generating missing segments). Critically, A2SB…
▽ More
Audio in the real world may be perturbed due to numerous factors, causing the audio quality to be degraded. The following work presents an audio restoration model tailored for high-res music at 44.1kHz. Our model, Audio-to-Audio Schrodinger Bridges (A2SB), is capable of both bandwidth extension (predicting high-frequency components) and inpainting (re-generating missing segments). Critically, A2SB is end-to-end without need of a vocoder to predict waveform outputs, able to restore hour-long audio inputs, and trained on permissively licensed music data. A2SB is capable of achieving state-of-the-art bandwidth extension and inpainting quality on several out-of-distribution music test sets. Our demo website is https: //research.nvidia.com/labs/adlr/A2SB/.
△ Less
Submitted 20 January, 2025;
originally announced January 2025.
-
Smart Contract Fuzzing Towards Profitable Vulnerabilities
Authors:
Ziqiao Kong,
Cen Zhang,
Maoyi Xie,
Ming Hu,
Yue Xue,
Ye Liu,
Haijun Wang,
Yang Liu
Abstract:
Billions of dollars are transacted through smart contracts, making vulnerabilities a major financial risk. One focus in the security arms race is on profitable vulnerabilities that attackers can exploit. Fuzzing is a key method for identifying these vulnerabilities. However, current solutions face two main limitations: a lack of profit-centric techniques for expediting detection, and insufficient…
▽ More
Billions of dollars are transacted through smart contracts, making vulnerabilities a major financial risk. One focus in the security arms race is on profitable vulnerabilities that attackers can exploit. Fuzzing is a key method for identifying these vulnerabilities. However, current solutions face two main limitations: a lack of profit-centric techniques for expediting detection, and insufficient automation in maximizing the profitability of discovered vulnerabilities, leaving the analysis to human experts. To address these gaps, we have developed VERITE, a profit-centric smart contract fuzzing framework that not only effectively detects those profitable vulnerabilities but also maximizes the exploited profits.
VERITE has three key features: 1) DeFi action-based mutators for boosting the exploration of transactions with different fund flows; 2) potentially profitable candidates identification criteria, which checks whether the input has caused abnormal fund flow properties during testing; 3) a gradient descent-based profit maximization strategy for these identified candidates.
VERITE is fully developed from scratch and evaluated on a dataset consisting of 61 exploited real-world DeFi projects with an average of over 1.1 million dollars loss. The results show that VERITE can automatically extract more than 18 million dollars in total and is significantly better than state-of-the-art fuzzer ITYFUZZ in both detection (29/10) and exploitation (134 times more profits gained on average). Remarkably, in 12 targets, it gains more profits than real-world attacking exploits (1.01 to 11.45 times more). VERITE is also applied by auditors in contract auditing, where 6 (5 high severity) zero-day vulnerabilities are found with over $2,500 bounty rewards.
△ Less
Submitted 12 February, 2025; v1 submitted 15 January, 2025;
originally announced January 2025.
-
RoRA: Efficient Fine-Tuning of LLM with Reliability Optimization for Rank Adaptation
Authors:
Jun Liu,
Zhenglun Kong,
Peiyan Dong,
Changdi Yang,
Xuan Shen,
Pu Zhao,
Hao Tang,
Geng Yuan,
Wei Niu,
Wenbin Zhang,
Xue Lin,
Dong Huang,
Yanzhi Wang
Abstract:
Fine-tuning helps large language models (LLM) recover degraded information and enhance task performance. Although Low-Rank Adaptation (LoRA) is widely used and effective for fine-tuning, we have observed that its scaling factor can limit or even reduce performance as the rank size increases. To address this issue, we propose RoRA (Rank-adaptive Reliability Optimization), a simple yet effective met…
▽ More
Fine-tuning helps large language models (LLM) recover degraded information and enhance task performance. Although Low-Rank Adaptation (LoRA) is widely used and effective for fine-tuning, we have observed that its scaling factor can limit or even reduce performance as the rank size increases. To address this issue, we propose RoRA (Rank-adaptive Reliability Optimization), a simple yet effective method for optimizing LoRA's scaling factor. By replacing $α/r$ with $α/\sqrt{r}$, RoRA ensures improved performance as rank size increases. Moreover, RoRA enhances low-rank adaptation in fine-tuning uncompressed models and excels in the more challenging task of accuracy recovery when fine-tuning pruned models. Extensive experiments demonstrate the effectiveness of RoRA in fine-tuning both uncompressed and pruned models. RoRA surpasses the state-of-the-art (SOTA) in average accuracy and robustness on LLaMA-7B/13B, LLaMA2-7B, and LLaMA3-8B, specifically outperforming LoRA and DoRA by 6.5% and 2.9% on LLaMA-7B, respectively. In pruned model fine-tuning, RoRA shows significant advantages; for SHEARED-LLAMA-1.3, a LLaMA-7B with 81.4% pruning, RoRA achieves 5.7% higher average accuracy than LoRA and 3.9% higher than DoRA.
△ Less
Submitted 11 January, 2025; v1 submitted 8 January, 2025;
originally announced January 2025.
-
TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization
Authors:
Chia-Yu Hung,
Navonil Majumder,
Zhifeng Kong,
Ambuj Mehrish,
Amir Ali Bagherzadeh,
Chuan Li,
Rafael Valle,
Bryan Catanzaro,
Soujanya Poria
Abstract:
We introduce TangoFlux, an efficient Text-to-Audio (TTA) generative model with 515M parameters, capable of generating up to 30 seconds of 44.1kHz audio in just 3.7 seconds on a single A40 GPU. A key challenge in aligning TTA models lies in the difficulty of creating preference pairs, as TTA lacks structured mechanisms like verifiable rewards or gold-standard answers available for Large Language Mo…
▽ More
We introduce TangoFlux, an efficient Text-to-Audio (TTA) generative model with 515M parameters, capable of generating up to 30 seconds of 44.1kHz audio in just 3.7 seconds on a single A40 GPU. A key challenge in aligning TTA models lies in the difficulty of creating preference pairs, as TTA lacks structured mechanisms like verifiable rewards or gold-standard answers available for Large Language Models (LLMs). To address this, we propose CLAP-Ranked Preference Optimization (CRPO), a novel framework that iteratively generates and optimizes preference data to enhance TTA alignment. We demonstrate that the audio preference dataset generated using CRPO outperforms existing alternatives. With this framework, TangoFlux achieves state-of-the-art performance across both objective and subjective benchmarks. We open source all code and models to support further research in TTA generation.
△ Less
Submitted 10 April, 2025; v1 submitted 30 December, 2024;
originally announced December 2024.
-
ETTA: Elucidating the Design Space of Text-to-Audio Models
Authors:
Sang-gil Lee,
Zhifeng Kong,
Arushi Goel,
Sungwon Kim,
Rafael Valle,
Bryan Catanzaro
Abstract:
Recent years have seen significant progress in Text-To-Audio (TTA) synthesis, enabling users to enrich their creative workflows with synthetic audio generated from natural language prompts. Despite this progress, the effects of data, model architecture, training objective functions, and sampling strategies on target benchmarks are not well understood. With the purpose of providing a holistic under…
▽ More
Recent years have seen significant progress in Text-To-Audio (TTA) synthesis, enabling users to enrich their creative workflows with synthetic audio generated from natural language prompts. Despite this progress, the effects of data, model architecture, training objective functions, and sampling strategies on target benchmarks are not well understood. With the purpose of providing a holistic understanding of the design space of TTA models, we set up a large-scale empirical experiment focused on diffusion and flow matching models. Our contributions include: 1) AF-Synthetic, a large dataset of high quality synthetic captions obtained from an audio understanding model; 2) a systematic comparison of different architectural, training, and inference design choices for TTA models; 3) an analysis of sampling methods and their Pareto curves with respect to generation quality and inference speed. We leverage the knowledge obtained from this extensive analysis to propose our best model dubbed Elucidated Text-To-Audio (ETTA). When evaluated on AudioCaps and MusicCaps, ETTA provides improvements over the baselines trained on publicly available data, while being competitive with models trained on proprietary data. Finally, we show ETTA's improved ability to generate creative audio following complex and imaginative captions -- a task that is more challenging than current benchmarks.
△ Less
Submitted 30 June, 2025; v1 submitted 26 December, 2024;
originally announced December 2024.
-
7B Fully Open Source Moxin-LLM/VLM -- From Pretraining to GRPO-based Reinforcement Learning Enhancement
Authors:
Pu Zhao,
Xuan Shen,
Zhenglun Kong,
Yixin Shen,
Sung-En Chang,
Arash Akbari,
Timothy Rupprecht,
Lei Lu,
Enfu Nan,
Changdi Yang,
Yumei He,
Weiyan Shi,
Xingchen Xu,
Yu Huang,
Wei Jiang,
Wei Wang,
Yue Chen,
Yong He,
Yanzhi Wang
Abstract:
Recently, Large Language Models (LLMs) have undergone a significant transformation, marked by a rapid rise in both their popularity and capabilities. Leading this evolution are proprietary LLMs like GPT-4 and GPT-o1, which have captured widespread attention in the AI community due to their remarkable performance and versatility. Simultaneously, open-source LLMs, such as LLaMA, have made great cont…
▽ More
Recently, Large Language Models (LLMs) have undergone a significant transformation, marked by a rapid rise in both their popularity and capabilities. Leading this evolution are proprietary LLMs like GPT-4 and GPT-o1, which have captured widespread attention in the AI community due to their remarkable performance and versatility. Simultaneously, open-source LLMs, such as LLaMA, have made great contributions to the ever-increasing popularity of LLMs due to the ease to customize and deploy the models across diverse applications. Although open-source LLMs present unprecedented opportunities for innovation and research, the commercialization of LLMs has raised concerns about transparency, reproducibility, and safety. Many open-source LLMs fail to meet fundamental transparency requirements by withholding essential components like training code and data, which may hinder further innovations on LLMs. To mitigate this issue, we introduce Moxin 7B, a fully open-source LLM developed, adhering to principles of open science, open source, open data, and open access. We release the pre-training code and configurations, training and fine-tuning datasets, and intermediate and final checkpoints, aiming to make continuous commitments to fully open-source LLMs. After pre-training the base model, we finetune the Moxin Base model with SOTA post-training framework and instruction data to obtain Moxin Instruct model. To improve the reasoning capability, we further finetune our Instruct model with chain-of-thought data distilled from DeepSeek R1, and then use Group Relative Policy Optimization (GRPO) following DeepSeek R1 to finetune our model, leading to the Moxin Reasoning model. Moreover, we develop our vision language model based on our Moxin model. Experiments show that our models achieve superior performance in various evaluations such as zero-shot evaluation, few-shot evaluation, and CoT evaluation.
△ Less
Submitted 3 July, 2025; v1 submitted 7 December, 2024;
originally announced December 2024.
-
S$^2$ALM: Sequence-Structure Pre-trained Large Language Model for Comprehensive Antibody Representation Learning
Authors:
Mingze Yin,
Hanjing Zhou,
Jialu Wu,
Yiheng Zhu,
Yuxuan Zhan,
Zitai Kong,
Hongxia Xu,
Chang-Yu Hsieh,
Jintai Chen,
Tingjun Hou,
Jian Wu
Abstract:
Antibodies safeguard our health through their precise and potent binding to specific antigens, demonstrating promising therapeutic efficacy in the treatment of numerous diseases, including COVID-19. Recent advancements in biomedical language models have shown the great potential to interpret complex biological structures and functions. However, existing antibody specific models have a notable limi…
▽ More
Antibodies safeguard our health through their precise and potent binding to specific antigens, demonstrating promising therapeutic efficacy in the treatment of numerous diseases, including COVID-19. Recent advancements in biomedical language models have shown the great potential to interpret complex biological structures and functions. However, existing antibody specific models have a notable limitation that they lack explicit consideration for antibody structural information, despite the fact that both 1D sequence and 3D structure carry unique and complementary insights into antibody behavior and functionality. This paper proposes Sequence-Structure multi-level pre-trained Antibody Language Model (S$^2$ALM), combining holistic sequential and structural information in one unified, generic antibody foundation model. We construct a hierarchical pre-training paradigm incorporated with two customized multi-level training objectives to facilitate the modeling of comprehensive antibody representations. S$^2$ALM's representation space uncovers inherent functional binding mechanisms, biological evolution properties and structural interaction patterns. Pre-trained over 75 million sequences and 11.7 million structures, S$^2$ALM can be adopted for diverse downstream tasks: accurately predicting antigen-antibody binding affinities, precisely distinguishing B cell maturation stages, identifying antibody crucial binding positions, and specifically designing novel coronavirus-binding antibodies. Remarkably, S$^2$ALM outperforms well-established and renowned baselines and sets new state-of-the-art performance across extensive antibody specific understanding and generation tasks. S$^2$ALM's ability to model comprehensive and generalized representations further positions its potential to advance real-world therapeutic antibody development, potentially addressing unmet academic, industrial, and clinical needs.
△ Less
Submitted 20 November, 2024;
originally announced November 2024.
-
High-Fidelity Cellular Network Control-Plane Traffic Generation without Domain Knowledge
Authors:
Z. Jonny Kong,
Nathan Hu,
Y. Charlie Hu,
Jiayi Meng,
Yaron Koral
Abstract:
With rapid evolution of mobile core network (MCN) architectures, large-scale control-plane traffic (CPT) traces are critical to studying MCN design and performance optimization by the R&D community. The prior-art control-plane traffic generator SMM heavily relies on domain knowledge which requires re-design as the domain evolves. In this work, we study the feasibility of developing a high-fidelity…
▽ More
With rapid evolution of mobile core network (MCN) architectures, large-scale control-plane traffic (CPT) traces are critical to studying MCN design and performance optimization by the R&D community. The prior-art control-plane traffic generator SMM heavily relies on domain knowledge which requires re-design as the domain evolves. In this work, we study the feasibility of developing a high-fidelity MCN control plane traffic generator by leveraging generative ML models. We identify key challenges in synthesizing high-fidelity CPT including generic (to data-plane) requirements such as multimodality feature relationships and unique requirements such as stateful semantics and long-term (time-of-day) data variations. We show state-of-the-art, generative adversarial network (GAN)-based approaches shown to work well for data-plane traffic cannot meet these fidelity requirements of CPT, and develop a transformer-based model, CPT-GPT, that accurately captures complex dependencies among the samples in each traffic stream (control events by the same UE) without the need for GAN. Our evaluation of CPT-GPT on a large-scale control-plane traffic trace shows that (1) it does not rely on domain knowledge yet synthesizes control-plane traffic with comparable fidelity as SMM; (2) compared to the prior-art GAN-based approach, it reduces the fraction of streams that violate stateful semantics by two orders of magnitude, the max y-distance of sojourn time distributions of streams by 16.0%, and the transfer learning time in deriving new hourly models by 3.36x.
△ Less
Submitted 11 November, 2024;
originally announced November 2024.
-
Fast and Memory-Efficient Video Diffusion Using Streamlined Inference
Authors:
Zheng Zhan,
Yushu Wu,
Yifan Gong,
Zichong Meng,
Zhenglun Kong,
Changdi Yang,
Geng Yuan,
Pu Zhao,
Wei Niu,
Yanzhi Wang
Abstract:
The rapid progress in artificial intelligence-generated content (AIGC), especially with diffusion models, has significantly advanced development of high-quality video generation. However, current video diffusion models exhibit demanding computational requirements and high peak memory usage, especially for generating longer and higher-resolution videos. These limitations greatly hinder the practica…
▽ More
The rapid progress in artificial intelligence-generated content (AIGC), especially with diffusion models, has significantly advanced development of high-quality video generation. However, current video diffusion models exhibit demanding computational requirements and high peak memory usage, especially for generating longer and higher-resolution videos. These limitations greatly hinder the practical application of video diffusion models on standard hardware platforms. To tackle this issue, we present a novel, training-free framework named Streamlined Inference, which leverages the temporal and spatial properties of video diffusion models. Our approach integrates three core components: Feature Slicer, Operator Grouping, and Step Rehash. Specifically, Feature Slicer effectively partitions input features into sub-features and Operator Grouping processes each sub-feature with a group of consecutive operators, resulting in significant memory reduction without sacrificing the quality or speed. Step Rehash further exploits the similarity between adjacent steps in diffusion, and accelerates inference through skipping unnecessary steps. Extensive experiments demonstrate that our approach significantly reduces peak memory and computational overhead, making it feasible to generate high-quality videos on a single consumer GPU (e.g., reducing peak memory of AnimateDiff from 42GB to 11GB, featuring faster inference on 2080Ti).
△ Less
Submitted 2 November, 2024;
originally announced November 2024.
-
A Multi-Granularity Supervised Contrastive Framework for Remaining Useful Life Prediction of Aero-engines
Authors:
Zixuan He,
Ziqian Kong,
Zhengyu Chen,
Yuling Zhan,
Zijun Que,
Zhengguo Xu
Abstract:
Accurate remaining useful life (RUL) predictions are critical to the safe operation of aero-engines. Currently, the RUL prediction task is mainly a regression paradigm with only mean square error as the loss function and lacks research on feature space structure, the latter of which has shown excellent performance in a large number of studies. This paper develops a multi-granularity supervised con…
▽ More
Accurate remaining useful life (RUL) predictions are critical to the safe operation of aero-engines. Currently, the RUL prediction task is mainly a regression paradigm with only mean square error as the loss function and lacks research on feature space structure, the latter of which has shown excellent performance in a large number of studies. This paper develops a multi-granularity supervised contrastive (MGSC) framework from plain intuition that samples with the same RUL label should be aligned in the feature space, and address the problems of too large minibatch size and unbalanced samples in the implementation. The RUL prediction with MGSC is implemented on using the proposed multi-phase training strategy. This paper also demonstrates a simple and scalable basic network structure and validates the proposed MGSC strategy on the CMPASS dataset using a convolutional long short-term memory network as a baseline, which effectively improves the accuracy of RUL prediction.
△ Less
Submitted 14 November, 2024; v1 submitted 1 November, 2024;
originally announced November 2024.
-
Energy-Optimal Planning of Waypoint-Based UAV Missions -- Does Minimum Distance Mean Minimum Energy?
Authors:
Nicolas Michel,
Ayush Patnaik,
Zhaodan Kong,
Xinfan Lin
Abstract:
Multirotor unmanned aerial vehicle is a prevailing type of aerial robots with wide real-world applications. The energy efficiency of the robot is a critical aspect of its performance, determining the range and duration of the missions that can be performed. This paper studies the energy-optimal planning of the multirotor, which aims at finding the optimal ordering of waypoints with the minimum ene…
▽ More
Multirotor unmanned aerial vehicle is a prevailing type of aerial robots with wide real-world applications. The energy efficiency of the robot is a critical aspect of its performance, determining the range and duration of the missions that can be performed. This paper studies the energy-optimal planning of the multirotor, which aims at finding the optimal ordering of waypoints with the minimum energy consumption for missions in 3D space. The study is performed based on a previously developed model capturing first-principle energy dynamics of the multirotor. We found that in majority of the cases (up to 95%) the solutions of the energy-optimal planning are different from those of the traditional traveling salesman problem which minimizes the total distance. The difference can be as high as 14.9%, with the average at 1.6%-3.3% and 90th percentile at 3.7%-6.5% depending on the range and number of waypoints in the mission. We then identified and explained the key features of the minimum-energy order by correlating to the underlying flight energy dynamics. It is shown that instead of minimizing the distance, coordination of vertical and horizontal motion to promote aerodynamic efficiency is the key to optimizing energy consumption.
△ Less
Submitted 23 October, 2024;
originally announced October 2024.
-
Pruning Foundation Models for High Accuracy without Retraining
Authors:
Pu Zhao,
Fei Sun,
Xuan Shen,
Pinrui Yu,
Zhenglun Kong,
Yanzhi Wang,
Xue Lin
Abstract:
Despite the superior performance, it is challenging to deploy foundation models or large language models (LLMs) due to their massive parameters and computations. While pruning is a promising technique to reduce model size and accelerate the inference, the traditional pruning techniques can hardly be applied for LLMs as they need to finetune the model on the full dataset with multiple epochs consum…
▽ More
Despite the superior performance, it is challenging to deploy foundation models or large language models (LLMs) due to their massive parameters and computations. While pruning is a promising technique to reduce model size and accelerate the inference, the traditional pruning techniques can hardly be applied for LLMs as they need to finetune the model on the full dataset with multiple epochs consuming massive data and hardware resources. To deal with this problem, post-training pruning methods are proposed to prune LLMs in one-shot without retraining. However, their accuracy after pruning may suffer from certain performance degradation due to the lack of retraining with massive data. To address this issue, in this paper, we first formulate the post-training problem for layer-wise LLM compression to simultaneously prune multiple weights in LLMs. Next, we provide an optimal solution for this problem and design our post-training pruning algorithm for both unstructured and semi-structured sparsity. Our extensive experiments demonstrate the superior performance of the proposed methods in comparison to SOTA baselines across various LLM families including transformer-based LLMs and Mamba-based LLMs. Code link: https://github.com/piuzha/APT
△ Less
Submitted 20 October, 2024;
originally announced October 2024.
-
Rethinking Token Reduction for State Space Models
Authors:
Zheng Zhan,
Yushu Wu,
Zhenglun Kong,
Changdi Yang,
Yifan Gong,
Xuan Shen,
Xue Lin,
Pu Zhao,
Yanzhi Wang
Abstract:
Recent advancements in State Space Models (SSMs) have attracted significant interest, particularly in models optimized for parallel training and handling long-range dependencies. Architectures like Mamba have scaled to billions of parameters with selective SSM. To facilitate broader applications using Mamba, exploring its efficiency is crucial. While token reduction techniques offer a straightforw…
▽ More
Recent advancements in State Space Models (SSMs) have attracted significant interest, particularly in models optimized for parallel training and handling long-range dependencies. Architectures like Mamba have scaled to billions of parameters with selective SSM. To facilitate broader applications using Mamba, exploring its efficiency is crucial. While token reduction techniques offer a straightforward post-training strategy, we find that applying existing methods directly to SSMs leads to substantial performance drops. Through insightful analysis, we identify the reasons for this failure and the limitations of current techniques. In response, we propose a tailored, unified post-training token reduction method for SSMs. Our approach integrates token importance and similarity, thus taking advantage of both pruning and merging, to devise a fine-grained intra-layer token reduction strategy. Extensive experiments show that our method improves the average accuracy by 5.7% to 13.1% on six benchmarks with Mamba-2 compared to existing methods, while significantly reducing computational demands and memory requirements.
△ Less
Submitted 15 October, 2024;
originally announced October 2024.
-
Interpreting Inflammation Prediction Model via Tag-based Cohort Explanation
Authors:
Fanyu Meng,
Jules Larke,
Xin Liu,
Zhaodan Kong,
Xin Chen,
Danielle Lemay,
Ilias Tagkopoulos
Abstract:
Machine learning is revolutionizing nutrition science by enabling systems to learn from data and make intelligent decisions. However, the complexity of these models often leads to challenges in understanding their decision-making processes, necessitating the development of explainability techniques to foster trust and increase model transparency. An under-explored type of explanation is cohort exp…
▽ More
Machine learning is revolutionizing nutrition science by enabling systems to learn from data and make intelligent decisions. However, the complexity of these models often leads to challenges in understanding their decision-making processes, necessitating the development of explainability techniques to foster trust and increase model transparency. An under-explored type of explanation is cohort explanation, which provides explanations to groups of instances with similar characteristics. Unlike traditional methods that focus on individual explanations or global model behavior, cohort explainability bridges the gap by providing unique insights at an intermediate granularity. We propose a novel framework for identifying cohorts within a dataset based on local feature importance scores, aiming to generate concise descriptions of the clusters via tags. We evaluate our framework on a food-based inflammation prediction model and demonstrated that the framework can generate reliable explanations that match domain knowledge.
△ Less
Submitted 17 October, 2024;
originally announced October 2024.
-
CohEx: A Generalized Framework for Cohort Explanation
Authors:
Fanyu Meng,
Xin Liu,
Zhaodan Kong,
Xin Chen
Abstract:
eXplainable Artificial Intelligence (XAI) has garnered significant attention for enhancing transparency and trust in machine learning models. However, the scopes of most existing explanation techniques focus either on offering a holistic view of the explainee model (global explanation) or on individual instances (local explanation), while the middle ground, i.e., cohort-based explanation, is less…
▽ More
eXplainable Artificial Intelligence (XAI) has garnered significant attention for enhancing transparency and trust in machine learning models. However, the scopes of most existing explanation techniques focus either on offering a holistic view of the explainee model (global explanation) or on individual instances (local explanation), while the middle ground, i.e., cohort-based explanation, is less explored. Cohort explanations offer insights into the explainee's behavior on a specific group or cohort of instances, enabling a deeper understanding of model decisions within a defined context. In this paper, we discuss the unique challenges and opportunities associated with measuring cohort explanations, define their desired properties, and create a generalized framework for generating cohort explanations based on supervised clustering.
△ Less
Submitted 11 December, 2024; v1 submitted 16 October, 2024;
originally announced October 2024.
-
Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data
Authors:
Sreyan Ghosh,
Sonal Kumar,
Zhifeng Kong,
Rafael Valle,
Bryan Catanzaro,
Dinesh Manocha
Abstract:
We present Synthio, a novel approach for augmenting small-scale audio classification datasets with synthetic data. Our goal is to improve audio classification accuracy with limited labeled data. Traditional data augmentation techniques, which apply artificial transformations (e.g., adding random noise or masking segments), struggle to create data that captures the true diversity present in real-wo…
▽ More
We present Synthio, a novel approach for augmenting small-scale audio classification datasets with synthetic data. Our goal is to improve audio classification accuracy with limited labeled data. Traditional data augmentation techniques, which apply artificial transformations (e.g., adding random noise or masking segments), struggle to create data that captures the true diversity present in real-world audios. To address this shortcoming, we propose to augment the dataset with synthetic audio generated from text-to-audio (T2A) diffusion models. However, synthesizing effective augmentations is challenging because not only should the generated data be acoustically consistent with the underlying small-scale dataset, but they should also have sufficient compositional diversity. To overcome the first challenge, we align the generations of the T2A model with the small-scale dataset using preference optimization. This ensures that the acoustic characteristics of the generated data remain consistent with the small-scale dataset. To address the second challenge, we propose a novel caption generation technique that leverages the reasoning capabilities of Large Language Models to (1) generate diverse and meaningful audio captions and (2) iteratively refine their quality. The generated captions are then used to prompt the aligned T2A model. We extensively evaluate Synthio on ten datasets and four simulated limited-data settings. Results indicate our method consistently outperforms all baselines by 0.1%-39% using a T2A model trained only on weakly-captioned AudioSet.
△ Less
Submitted 11 March, 2025; v1 submitted 2 October, 2024;
originally announced October 2024.
-
Exploring Token Pruning in Vision State Space Models
Authors:
Zheng Zhan,
Zhenglun Kong,
Yifan Gong,
Yushu Wu,
Zichong Meng,
Hangyu Zheng,
Xuan Shen,
Stratis Ioannidis,
Wei Niu,
Pu Zhao,
Yanzhi Wang
Abstract:
State Space Models (SSMs) have the advantage of keeping linear computational complexity compared to attention modules in transformers, and have been applied to vision tasks as a new type of powerful vision foundation model. Inspired by the observations that the final prediction in vision transformers (ViTs) is only based on a subset of most informative tokens, we take the novel step of enhancing t…
▽ More
State Space Models (SSMs) have the advantage of keeping linear computational complexity compared to attention modules in transformers, and have been applied to vision tasks as a new type of powerful vision foundation model. Inspired by the observations that the final prediction in vision transformers (ViTs) is only based on a subset of most informative tokens, we take the novel step of enhancing the efficiency of SSM-based vision models through token-based pruning. However, direct applications of existing token pruning techniques designed for ViTs fail to deliver good performance, even with extensive fine-tuning. To address this issue, we revisit the unique computational characteristics of SSMs and discover that naive application disrupts the sequential token positions. This insight motivates us to design a novel and general token pruning method specifically for SSM-based vision models. We first introduce a pruning-aware hidden state alignment method to stabilize the neighborhood of remaining tokens for performance enhancement. Besides, based on our detailed analysis, we propose a token importance evaluation method adapted for SSM models, to guide the token pruning. With efficient implementation and practical acceleration methods, our method brings actual speedup. Extensive experiments demonstrate that our approach can achieve significant computation reduction with minimal impact on performance across different tasks. Notably, we achieve 81.7\% accuracy on ImageNet with a 41.6\% reduction in the FLOPs for pruned PlainMamba-L3. Furthermore, our work provides deeper insights into understanding the behavior of SSM-based vision models for future research.
△ Less
Submitted 27 September, 2024;
originally announced September 2024.
-
Search for Efficient Large Language Models
Authors:
Xuan Shen,
Pu Zhao,
Yifan Gong,
Zhenglun Kong,
Zheng Zhan,
Yushu Wu,
Ming Lin,
Chao Wu,
Xue Lin,
Yanzhi Wang
Abstract:
Large Language Models (LLMs) have long held sway in the realms of artificial intelligence research. Numerous efficient techniques, including weight pruning, quantization, and distillation, have been embraced to compress LLMs, targeting memory reduction and inference acceleration, which underscore the redundancy in LLMs. However, most model compression techniques concentrate on weight optimization,…
▽ More
Large Language Models (LLMs) have long held sway in the realms of artificial intelligence research. Numerous efficient techniques, including weight pruning, quantization, and distillation, have been embraced to compress LLMs, targeting memory reduction and inference acceleration, which underscore the redundancy in LLMs. However, most model compression techniques concentrate on weight optimization, overlooking the exploration of optimal architectures. Besides, traditional architecture search methods, limited by the elevated complexity with extensive parameters, struggle to demonstrate their effectiveness on LLMs. In this paper, we propose a training-free architecture search framework to identify optimal subnets that preserve the fundamental strengths of the original LLMs while achieving inference acceleration. Furthermore, after generating subnets that inherit specific weights from the original LLMs, we introduce a reformation algorithm that utilizes the omitted weights to rectify the inherited weights with a small amount of calibration data. Compared with SOTA training-free structured pruning works that can generate smaller networks, our method demonstrates superior performance across standard benchmarks. Furthermore, our generated subnets can directly reduce the usage of GPU memory and achieve inference acceleration. Code: https://github.com/shawnricecake/search-llm
△ Less
Submitted 30 October, 2024; v1 submitted 25 September, 2024;
originally announced September 2024.
-
StereoCrafter: Diffusion-based Generation of Long and High-fidelity Stereoscopic 3D from Monocular Videos
Authors:
Sijie Zhao,
Wenbo Hu,
Xiaodong Cun,
Yong Zhang,
Xiaoyu Li,
Zhe Kong,
Xiangjun Gao,
Muyao Niu,
Ying Shan
Abstract:
This paper presents a novel framework for converting 2D videos to immersive stereoscopic 3D, addressing the growing demand for 3D content in immersive experience. Leveraging foundation models as priors, our approach overcomes the limitations of traditional methods and boosts the performance to ensure the high-fidelity generation required by the display devices. The proposed system consists of two…
▽ More
This paper presents a novel framework for converting 2D videos to immersive stereoscopic 3D, addressing the growing demand for 3D content in immersive experience. Leveraging foundation models as priors, our approach overcomes the limitations of traditional methods and boosts the performance to ensure the high-fidelity generation required by the display devices. The proposed system consists of two main steps: depth-based video splatting for warping and extracting occlusion mask, and stereo video inpainting. We utilize pre-trained stable video diffusion as the backbone and introduce a fine-tuning protocol for the stereo video inpainting task. To handle input video with varying lengths and resolutions, we explore auto-regressive strategies and tiled processing. Finally, a sophisticated data processing pipeline has been developed to reconstruct a large-scale and high-quality dataset to support our training. Our framework demonstrates significant improvements in 2D-to-3D video conversion, offering a practical solution for creating immersive content for 3D devices like Apple Vision Pro and 3D displays. In summary, this work contributes to the field by presenting an effective method for generating high-quality stereoscopic videos from monocular input, potentially transforming how we experience digital media.
△ Less
Submitted 11 September, 2024;
originally announced September 2024.
-
GRATR: Zero-Shot Evidence Graph Retrieval-Augmented Trustworthiness Reasoning
Authors:
Ying Zhu,
Shengchang Li,
Ziqian Kong,
Qiang Yang,
Peilan Xu
Abstract:
Trustworthiness reasoning aims to enable agents in multiplayer games with incomplete information to identify potential allies and adversaries, thereby enhancing decision-making. In this paper, we introduce the graph retrieval-augmented trustworthiness reasoning (GRATR) framework, which retrieves observable evidence from the game environment to inform decision-making by large language models (LLMs)…
▽ More
Trustworthiness reasoning aims to enable agents in multiplayer games with incomplete information to identify potential allies and adversaries, thereby enhancing decision-making. In this paper, we introduce the graph retrieval-augmented trustworthiness reasoning (GRATR) framework, which retrieves observable evidence from the game environment to inform decision-making by large language models (LLMs) without requiring additional training, making it a zero-shot approach. Within the GRATR framework, agents first observe the actions of other players and evaluate the resulting shifts in inter-player trust, constructing a corresponding trustworthiness graph. During decision-making, the agent performs multi-hop retrieval to evaluate trustworthiness toward a specific target, where evidence chains are retrieved from multiple trusted sources to form a comprehensive assessment. Experiments in the multiplayer game \emph{Werewolf} demonstrate that GRATR outperforms the alternatives, improving reasoning accuracy by 50.5\% and reducing hallucination by 30.6\% compared to the baseline method. Additionally, when tested on a dataset of Twitter tweets during the U.S. election period, GRATR surpasses the baseline method by 10.4\% in accuracy, highlighting its potential in real-world applications such as intent analysis.
△ Less
Submitted 27 January, 2025; v1 submitted 22 August, 2024;
originally announced August 2024.
-
Image Denoising Using Green Channel Prior
Authors:
Zhaoming Kong,
Fangxi Deng,
Xiaowei Yang
Abstract:
Image denoising is an appealing and challenging task, in that noise statistics of real-world observations may vary with local image contents and different image channels. Specifically, the green channel usually has twice the sampling rate in raw data. To handle noise variances and leverage such channel-wise prior information, we propose a simple and effective green channel prior-based image denois…
▽ More
Image denoising is an appealing and challenging task, in that noise statistics of real-world observations may vary with local image contents and different image channels. Specifically, the green channel usually has twice the sampling rate in raw data. To handle noise variances and leverage such channel-wise prior information, we propose a simple and effective green channel prior-based image denoising (GCP-ID) method, which integrates GCP into the classic patch-based denoising framework. Briefly, we exploit the green channel to guide the search for similar patches, which aims to improve the patch grouping quality and encourage sparsity in the transform domain. The grouped image patches are then reformulated into RGGB arrays to explicitly characterize the density of green samples. Furthermore, to enhance the adaptivity of GCP-ID to various image contents, we cast the noise estimation problem into a classification task and train an effective estimator based on convolutional neural networks (CNNs). Experiments on real-world datasets demonstrate the competitive performance of the proposed GCP-ID method for image and video denoising applications in both raw and sRGB spaces. Our code is available at https://github.com/ZhaomingKong/GCP-ID.
△ Less
Submitted 12 August, 2024;
originally announced August 2024.
-
Anytime Trust Rating Dynamics in a Human-Robot Interaction Task
Authors:
Jason Dekarske,
Gregory Bales,
Zhaodan Kong,
Sanjay Joshi
Abstract:
Objective We model factors contributing to rating timing for a single-dimensional, any-time trust in robotics measure.
Background Many studies view trust as a slow-changing value after subjects complete a trial or at regular intervals. Trust is a multifaceted concept that can be measured simultaneously with a human-robot interaction.
Method 65 subjects commanded a remote robot arm in a simulat…
▽ More
Objective We model factors contributing to rating timing for a single-dimensional, any-time trust in robotics measure.
Background Many studies view trust as a slow-changing value after subjects complete a trial or at regular intervals. Trust is a multifaceted concept that can be measured simultaneously with a human-robot interaction.
Method 65 subjects commanded a remote robot arm in a simulated space station. The robot picked and placed stowage commanded by the subject, but the robot's performance varied from trial to trial. Subjects rated their trust on a non-obtrusive trust slider at any time throughout the experiment.
Results A Cox Proportional Hazards Model described the time it took subjects to rate their trust in the robot. A retrospective survey indicated that subjects based their trust on the robot's performance or outcome of the task. Strong covariates representing the task's state reflected this in the model.
Conclusion Trust and robot task performance contributed little to the timing of the trust rating. The subjects' exit survey responses aligned with the assumption that the robot's task progress was the main reason for the timing of their trust rating.
Application Measuring trust in a human-robot interaction task should take as little attention away from the task as possible. This trust rating technique lays the groundwork for single-dimensional trust queries that probe estimated human action.
△ Less
Submitted 31 July, 2024;
originally announced August 2024.
-
MambaCapsule: Towards Transparent Cardiac Disease Diagnosis with Electrocardiography Using Mamba Capsule Network
Authors:
Yinlong Xu,
Xiaoqiang Liu,
Zitai Kong,
Yixuan Wu,
Yue Wang,
Yingzhou Lu,
Honghao Gao,
Jian Wu,
Hongxia Xu
Abstract:
Cardiac arrhythmia, a condition characterized by irregular heartbeats, often serves as an early indication of various heart ailments. With the advent of deep learning, numerous innovative models have been introduced for diagnosing arrhythmias using Electrocardiogram (ECG) signals. However, recent studies solely focus on the performance of models, neglecting the interpretation of their results. Thi…
▽ More
Cardiac arrhythmia, a condition characterized by irregular heartbeats, often serves as an early indication of various heart ailments. With the advent of deep learning, numerous innovative models have been introduced for diagnosing arrhythmias using Electrocardiogram (ECG) signals. However, recent studies solely focus on the performance of models, neglecting the interpretation of their results. This leads to a considerable lack of transparency, posing a significant risk in the actual diagnostic process. To solve this problem, this paper introduces MambaCapsule, a deep neural networks for ECG arrhythmias classification, which increases the explainability of the model while enhancing the accuracy.Our model utilizes Mamba for feature extraction and Capsule networks for prediction, providing not only a confidence score but also signal features. Akin to the processing mechanism of human brain, the model learns signal features and their relationship between them by reconstructing ECG signals in the predicted selection. The model evaluation was conducted on MIT-BIH and PTB dataset, following the AAMI standard. MambaCapsule has achieved a total accuracy of 99.54% and 99.59% on the test sets respectively. These results demonstrate the promising performance of under the standard test protocol.
△ Less
Submitted 30 July, 2024;
originally announced July 2024.
-
Quasar-ViT: Hardware-Oriented Quantization-Aware Architecture Search for Vision Transformers
Authors:
Zhengang Li,
Alec Lu,
Yanyue Xie,
Zhenglun Kong,
Mengshu Sun,
Hao Tang,
Zhong Jia Xue,
Peiyan Dong,
Caiwen Ding,
Yanzhi Wang,
Xue Lin,
Zhenman Fang
Abstract:
Vision transformers (ViTs) have demonstrated their superior accuracy for computer vision tasks compared to convolutional neural networks (CNNs). However, ViT models are often computation-intensive for efficient deployment on resource-limited edge devices. This work proposes Quasar-ViT, a hardware-oriented quantization-aware architecture search framework for ViTs, to design efficient ViT models for…
▽ More
Vision transformers (ViTs) have demonstrated their superior accuracy for computer vision tasks compared to convolutional neural networks (CNNs). However, ViT models are often computation-intensive for efficient deployment on resource-limited edge devices. This work proposes Quasar-ViT, a hardware-oriented quantization-aware architecture search framework for ViTs, to design efficient ViT models for hardware implementation while preserving the accuracy. First, Quasar-ViT trains a supernet using our row-wise flexible mixed-precision quantization scheme, mixed-precision weight entanglement, and supernet layer scaling techniques. Then, it applies an efficient hardware-oriented search algorithm, integrated with hardware latency and resource modeling, to determine a series of optimal subnets from supernet under different inference latency targets. Finally, we propose a series of model-adaptive designs on the FPGA platform to support the architecture search and mitigate the gap between the theoretical computation reduction and the practical inference speedup. Our searched models achieve 101.5, 159.6, and 251.6 frames-per-second (FPS) inference speed on the AMD/Xilinx ZCU102 FPGA with 80.4%, 78.6%, and 74.9% top-1 accuracy, respectively, for the ImageNet dataset, consistently outperforming prior works.
△ Less
Submitted 25 July, 2024;
originally announced July 2024.
-
A Geometry-Aware Algorithm to Learn Hierarchical Embeddings in Hyperbolic Space
Authors:
Zhangyu Wang,
Lantian Xu,
Zhifeng Kong,
Weilong Wang,
Xuyu Peng,
Enyang Zheng
Abstract:
Hyperbolic embeddings are a class of representation learning methods that offer competitive performances when data can be abstracted as a tree-like graph. However, in practice, learning hyperbolic embeddings of hierarchical data is difficult due to the different geometry between hyperbolic space and the Euclidean space. To address such difficulties, we first categorize three kinds of illness that…
▽ More
Hyperbolic embeddings are a class of representation learning methods that offer competitive performances when data can be abstracted as a tree-like graph. However, in practice, learning hyperbolic embeddings of hierarchical data is difficult due to the different geometry between hyperbolic space and the Euclidean space. To address such difficulties, we first categorize three kinds of illness that harm the performance of the embeddings. Then, we develop a geometry-aware algorithm using a dilation operation and a transitive closure regularization to tackle these illnesses. We empirically validate these techniques and present a theoretical analysis of the mechanism behind the dilation operation. Experiments on synthetic and real-world datasets reveal superior performances of our algorithm.
△ Less
Submitted 23 July, 2024;
originally announced July 2024.
-
LayoutCopilot: An LLM-powered Multi-agent Collaborative Framework for Interactive Analog Layout Design
Authors:
Bingyang Liu,
Haoyi Zhang,
Xiaohan Gao,
Zichen Kong,
Xiyuan Tang,
Yibo Lin,
Runsheng Wang,
Ru Huang
Abstract:
Analog layout design heavily involves interactive processes between humans and design tools. Electronic Design Automation (EDA) tools for this task are usually designed to use scripting commands or visualized buttons for manipulation, especially for interactive automation functionalities, which have a steep learning curve and cumbersome user experience, making a notable barrier to designers' adopt…
▽ More
Analog layout design heavily involves interactive processes between humans and design tools. Electronic Design Automation (EDA) tools for this task are usually designed to use scripting commands or visualized buttons for manipulation, especially for interactive automation functionalities, which have a steep learning curve and cumbersome user experience, making a notable barrier to designers' adoption. Aiming to address such a usability issue, this paper introduces LayoutCopilot, a pioneering multi-agent collaborative framework powered by Large Language Models (LLMs) for interactive analog layout design. LayoutCopilot simplifies human-tool interaction by converting natural language instructions into executable script commands, and it interprets high-level design intents into actionable suggestions, significantly streamlining the design process. Experimental results demonstrate the flexibility, efficiency, and accessibility of LayoutCopilot in handling real-world analog designs.
△ Less
Submitted 13 January, 2025; v1 submitted 26 June, 2024;
originally announced June 2024.
-
Improving Text-To-Audio Models with Synthetic Captions
Authors:
Zhifeng Kong,
Sang-gil Lee,
Deepanway Ghosal,
Navonil Majumder,
Ambuj Mehrish,
Rafael Valle,
Soujanya Poria,
Bryan Catanzaro
Abstract:
It is an open challenge to obtain high quality training data, especially captions, for text-to-audio models. Although prior methods have leveraged \textit{text-only language models} to augment and improve captions, such methods have limitations related to scale and coherence between audio and captions. In this work, we propose an audio captioning pipeline that uses an \textit{audio language model}…
▽ More
It is an open challenge to obtain high quality training data, especially captions, for text-to-audio models. Although prior methods have leveraged \textit{text-only language models} to augment and improve captions, such methods have limitations related to scale and coherence between audio and captions. In this work, we propose an audio captioning pipeline that uses an \textit{audio language model} to synthesize accurate and diverse captions for audio at scale. We leverage this pipeline to produce a dataset of synthetic captions for AudioSet, named \texttt{AF-AudioSet}, and then evaluate the benefit of pre-training text-to-audio models on these synthetic captions. Through systematic evaluations on AudioCaps and MusicCaps, we find leveraging our pipeline and synthetic captions leads to significant improvements on audio generation quality, achieving a new \textit{state-of-the-art}.
△ Less
Submitted 8 July, 2024; v1 submitted 17 June, 2024;
originally announced June 2024.
-
A Reliable Framework for Human-in-the-Loop Anomaly Detection in Time Series
Authors:
Ziquan Deng,
Xiwei Xuan,
Kwan-Liu Ma,
Zhaodan Kong
Abstract:
Time series anomaly detection is a critical machine learning task for numerous applications, such as finance, healthcare, and industrial systems. However, even high-performing models may exhibit potential issues such as biases, leading to unreliable outcomes and misplaced confidence. While model explanation techniques, particularly visual explanations, offer valuable insights by elucidating model…
▽ More
Time series anomaly detection is a critical machine learning task for numerous applications, such as finance, healthcare, and industrial systems. However, even high-performing models may exhibit potential issues such as biases, leading to unreliable outcomes and misplaced confidence. While model explanation techniques, particularly visual explanations, offer valuable insights by elucidating model attributions of their decision, many limitations still exist -- They are primarily instance-based and not scalable across the dataset, and they provide one-directional information from the model to the human side, lacking a mechanism for users to address detected issues. To fulfill these gaps, we introduce HILAD, a novel framework designed to foster a dynamic and bidirectional collaboration between humans and AI for enhancing anomaly detection models in time series. Through our visual interface, HILAD empowers domain experts to detect, interpret, and correct unexpected model behaviors at scale. Our evaluation through user studies with two models and three time series datasets demonstrates the effectiveness of HILAD, which fosters a deeper model understanding, immediate corrective actions, and model reliability enhancement.
△ Less
Submitted 23 June, 2025; v1 submitted 6 May, 2024;
originally announced May 2024.
-
Dynamic Human Trust Modeling of Autonomous Agents With Varying Capability and Strategy
Authors:
Jason Dekarske,
Zhaodan Kong,
Sanjay Joshi
Abstract:
Objective We model the dynamic trust of human subjects in a human-autonomy-teaming screen-based task.
Background Trust is an emerging area of study in human-robot collaboration. Many studies have looked at the issue of robot performance as a sole predictor of human trust, but this could underestimate the complexity of the interaction.
Method Subjects were paired with autonomous agents to searc…
▽ More
Objective We model the dynamic trust of human subjects in a human-autonomy-teaming screen-based task.
Background Trust is an emerging area of study in human-robot collaboration. Many studies have looked at the issue of robot performance as a sole predictor of human trust, but this could underestimate the complexity of the interaction.
Method Subjects were paired with autonomous agents to search an on-screen grid to determine the number of outlier objects. In each trial, a different autonomous agent with a preassigned capability used one of three search strategies and then reported the number of outliers it found as a fraction of its capability. Then, the subject reported their total outlier estimate. Human subjects then evaluated statements about the agent's behavior, reliability, and their trust in the agent.
Results 80 subjects were recruited. Self-reported trust was modeled using Ordinary Least Squares, but the group that interacted with varying capability agents on a short time order produced a better performing ARIMAX model. Models were cross-validated between groups and found a moderate improvement in the next trial trust prediction.
Conclusion A time series modeling approach reveals the effects of temporal ordering of agent performance on estimated trust. Recency bias may affect how subjects weigh the contribution of strategy or capability to trust. Understanding the connections between agent behavior, agent performance, and human trust is crucial to improving human-robot collaborative tasks.
Application The modeling approach in this study demonstrates the need to represent autonomous agent characteristics over time to capture changes in human trust.
△ Less
Submitted 30 April, 2024;
originally announced April 2024.
-
Unleashing the Power of Multi-Task Learning: A Comprehensive Survey Spanning Traditional, Deep, and Pretrained Foundation Model Eras
Authors:
Jun Yu,
Yutong Dai,
Xiaokang Liu,
Jin Huang,
Yishan Shen,
Ke Zhang,
Rong Zhou,
Eashan Adhikarla,
Wenxuan Ye,
Yixin Liu,
Zhaoming Kong,
Kai Zhang,
Yilong Yin,
Vinod Namboodiri,
Brian D. Davison,
Jason H. Moore,
Yong Chen
Abstract:
MTL is a learning paradigm that effectively leverages both task-specific and shared information to address multiple related tasks simultaneously. In contrast to STL, MTL offers a suite of benefits that enhance both the training process and the inference efficiency. MTL's key advantages encompass streamlined model architecture, performance enhancement, and cross-domain generalizability. Over the pa…
▽ More
MTL is a learning paradigm that effectively leverages both task-specific and shared information to address multiple related tasks simultaneously. In contrast to STL, MTL offers a suite of benefits that enhance both the training process and the inference efficiency. MTL's key advantages encompass streamlined model architecture, performance enhancement, and cross-domain generalizability. Over the past twenty years, MTL has become widely recognized as a flexible and effective approach in various fields, including CV, NLP, recommendation systems, disease prognosis and diagnosis, and robotics. This survey provides a comprehensive overview of the evolution of MTL, encompassing the technical aspects of cutting-edge methods from traditional approaches to deep learning and the latest trend of pretrained foundation models. Our survey methodically categorizes MTL techniques into five key areas: regularization, relationship learning, feature propagation, optimization, and pre-training. This categorization not only chronologically outlines the development of MTL but also dives into various specialized strategies within each category. Furthermore, the survey reveals how the MTL evolves from handling a fixed set of tasks to embracing a more flexible approach free from task or modality constraints. It explores the concepts of task-promptable and -agnostic training, along with the capacity for ZSL, which unleashes the untapped potential of this historically coveted learning paradigm. Overall, we hope this survey provides the research community with a comprehensive overview of the advancements in MTL from its inception in 1997 to the present in 2023. We address present challenges and look ahead to future possibilities, shedding light on the opportunities and potential avenues for MTL research in a broad manner. This project is publicly available at https://github.com/junfish/Awesome-Multitask-Learning.
△ Less
Submitted 29 April, 2024;
originally announced April 2024.