Skip to main content

Showing 1–50 of 136 results for author: Kong, Z

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.04704  [pdf, ps, other

    q-bio.QM cs.AI cs.CV

    SPATIA: Multimodal Model for Prediction and Generation of Spatial Cell Phenotypes

    Authors: Zhenglun Kong, Mufan Qiu, John Boesen, Xiang Lin, Sukwon Yun, Tianlong Chen, Manolis Kellis, Marinka Zitnik

    Abstract: Understanding how cellular morphology, gene expression, and spatial organization jointly shape tissue function is a central challenge in biology. Image-based spatial transcriptomics technologies now provide high-resolution measurements of cell images and gene expression profiles, but machine learning methods typically analyze these modalities in isolation or at limited resolution. We address the p… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

  2. arXiv:2507.01012  [pdf, ps, other

    cs.CV

    DAM-VSR: Disentanglement of Appearance and Motion for Video Super-Resolution

    Authors: Zhe Kong, Le Li, Yong Zhang, Feng Gao, Shaoshu Yang, Tao Wang, Kaihao Zhang, Zhuoliang Kang, Xiaoming Wei, Guanying Chen, Wenhan Luo

    Abstract: Real-world video super-resolution (VSR) presents significant challenges due to complex and unpredictable degradations. Although some recent methods utilize image diffusion models for VSR and have shown improved detail generation capabilities, they still struggle to produce temporally consistent frames. We attempt to use Stable Video Diffusion (SVD) combined with ControlNet to address this issue. H… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: Accepted by ACM SIGGRAPH 2025, Homepage: https://kongzhecn.github.io/projects/dam-vsr/ Github: https://github.com/kongzhecn/DAM-VSR

  3. arXiv:2506.05709  [pdf, ps, other

    cs.CV

    Token Transforming: A Unified and Training-Free Token Compression Framework for Vision Transformer Acceleration

    Authors: Fanhu Zeng, Deli Yu, Zhenglun Kong, Hao Tang

    Abstract: Vision transformers have been widely explored in various vision tasks. Due to heavy computational cost, much interest has aroused for compressing vision transformer dynamically in the aspect of tokens. Current methods mainly pay attention to token pruning or merging to reduce token numbers, in which tokens are compressed exclusively, causing great information loss and therefore post-training is in… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

  4. arXiv:2505.23844  [pdf, ps, other

    cs.CL

    Enabling Flexible Multi-LLM Integration for Scalable Knowledge Aggregation

    Authors: Zhenglun Kong, Zheng Zhan, Shiyue Hou, Yifan Gong, Xin Meng, Pengwei Sui, Peiyan Dong, Xuan Shen, Zifeng Wang, Pu Zhao, Hao Tang, Stratis Ioannidis, Yanzhi Wang

    Abstract: Large language models (LLMs) have shown remarkable promise but remain challenging to continually improve through traditional finetuning, particularly when integrating capabilities from other specialized LLMs. Popular methods like ensemble and weight merging require substantial memory and struggle to adapt to changing data environments. Recent efforts have transferred knowledge from multiple LLMs i… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

  5. arXiv:2505.22647  [pdf, ps, other

    cs.CV

    Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation

    Authors: Zhe Kong, Feng Gao, Yong Zhang, Zhuoliang Kang, Xiaoming Wei, Xunliang Cai, Guanying Chen, Wenhan Luo

    Abstract: Audio-driven human animation methods, such as talking head and talking body generation, have made remarkable progress in generating synchronized facial movements and appealing visual quality videos. However, existing methods primarily focus on single human animation and struggle with multi-stream audio inputs, facing incorrect binding problems between audio and persons. Additionally, they exhibit… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

    Comments: Homepage: https://meigen-ai.github.io/multi-talk Github: https://github.com/MeiGen-AI/MultiTalk

  6. arXiv:2505.21987  [pdf, other

    cs.LG

    ACE: Exploring Activation Cosine Similarity and Variance for Accurate and Calibration-Efficient LLM Pruning

    Authors: Zhendong Mi, Zhenglun Kong, Geng Yuan, Shaoyi Huang

    Abstract: With the rapid expansion of large language models (LLMs), the demand for memory and computational resources has grown significantly. Recent advances in LLM pruning aim to reduce the size and computational cost of these models. However, existing methods often suffer from either suboptimal pruning performance or low time efficiency during the pruning process. In this work, we propose an efficient an… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

    Comments: 9 pages, 2 figures, 13 tables

    ACM Class: I.2.6; I.2.7

  7. arXiv:2505.18227  [pdf, ps, other

    cs.LG cs.AI

    Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality

    Authors: Zhenglun Kong, Yize Li, Fanhu Zeng, Lei Xin, Shvat Messica, Xue Lin, Pu Zhao, Manolis Kellis, Hao Tang, Marinka Zitnik

    Abstract: In Transformer architectures, tokens\textemdash discrete units derived from raw data\textemdash are formed by segmenting inputs into fixed-length chunks. Each token is then mapped to an embedding, enabling parallel attention computations while preserving the input's essential information. Due to the quadratic computational complexity of transformer self-attention mechanisms, token reduction has pr… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

  8. arXiv:2505.13820  [pdf, ps, other

    cs.LG cs.AI cs.CL

    Structured Agent Distillation for Large Language Model

    Authors: Jun Liu, Zhenglun Kong, Peiyan Dong, Changdi Yang, Tianqi Li, Hao Tang, Geng Yuan, Wei Niu, Wenbin Zhang, Pu Zhao, Xue Lin, Dong Huang, Yanzhi Wang

    Abstract: Large language models (LLMs) exhibit strong capabilities as decision-making agents by interleaving reasoning and actions, as seen in ReAct-style frameworks. Yet, their practical deployment is constrained by high inference costs and large model sizes. We propose Structured Agent Distillation, a framework that compresses large LLM-based agents into smaller student models while preserving both reason… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

  9. arXiv:2505.08748  [pdf, ps, other

    cs.LG

    Implet: A Post-hoc Subsequence Explainer for Time Series Models

    Authors: Fanyu Meng, Ziwen Kan, Shahbaz Rezaei, Zhaodan Kong, Xin Chen, Xin Liu

    Abstract: Explainability in time series models is crucial for fostering trust, facilitating debugging, and ensuring interpretability in real-world applications. In this work, we introduce Implet, a novel post-hoc explainer that generates accurate and concise subsequence-level explanations for time series models. Our approach identifies critical temporal segments that significantly contribute to the model's… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

  10. arXiv:2505.07365  [pdf, ps, other

    cs.SD cs.AI cs.CL cs.MM eess.AS

    Multi-Domain Audio Question Answering Toward Acoustic Content Reasoning in The DCASE 2025 Challenge

    Authors: Chao-Han Huck Yang, Sreyan Ghosh, Qing Wang, Jaeyeon Kim, Hengyi Hong, Sonal Kumar, Guirui Zhong, Zhifeng Kong, S Sakshi, Vaibhavi Lokegaonkar, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha, Gunhee Kim, Jun Du, Rafael Valle, Bryan Catanzaro

    Abstract: We present Task 5 of the DCASE 2025 Challenge: an Audio Question Answering (AQA) benchmark spanning multiple domains of sound understanding. This task defines three QA subsets (Bioacoustics, Temporal Soundscapes, and Complex QA) to test audio-language models on interactive question-answering over diverse acoustic scenes. We describe the dataset composition (from marine mammal calls to soundscapes… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

    Comments: Preprint. DCASE 2025 Audio QA Challenge: https://dcase.community/challenge2025/task-audio-question-answering

  11. arXiv:2504.16368  [pdf, other

    cs.CV

    Revisiting Radar Camera Alignment by Contrastive Learning for 3D Object Detection

    Authors: Linhua Kong, Dongxia Chang, Lian Liu, Zisen Kong, Pengyuan Li, Yao Zhao

    Abstract: Recently, 3D object detection algorithms based on radar and camera fusion have shown excellent performance, setting the stage for their application in autonomous driving perception tasks. Existing methods have focused on dealing with feature misalignment caused by the domain gap between radar and camera. However, existing methods either neglect inter-modal features interaction during alignment or… ▽ More

    Submitted 22 April, 2025; originally announced April 2025.

  12. arXiv:2504.10983  [pdf, other

    cs.LG cs.AI q-bio.BM

    ProtFlow: Fast Protein Sequence Design via Flow Matching on Compressed Protein Language Model Embeddings

    Authors: Zitai Kong, Yiheng Zhu, Yinlong Xu, Hanjing Zhou, Mingzhe Yin, Jialu Wu, Hongxia Xu, Chang-Yu Hsieh, Tingjun Hou, Jian Wu

    Abstract: The design of protein sequences with desired functionalities is a fundamental task in protein engineering. Deep generative methods, such as autoregressive models and diffusion models, have greatly accelerated the discovery of novel protein sequences. However, these methods mainly focus on local or shallow residual semantics and suffer from low inference efficiency, large modeling space and high tr… ▽ More

    Submitted 15 April, 2025; originally announced April 2025.

  13. arXiv:2504.03763  [pdf, other

    cs.AR cs.AI cs.LG

    Efficient Calibration for RRAM-based In-Memory Computing using DoRA

    Authors: Weirong Dong, Kai Zhou, Zhen Kong, Quan Cheng, Junkai Huang, Zhengke Yang, Masanori Hashimoto, Longyang Lin

    Abstract: Resistive In-Memory Computing (RIMC) offers ultra-efficient computation for edge AI but faces accuracy degradation due to RRAM conductance drift over time. Traditional retraining methods are limited by RRAM's high energy consumption, write latency, and endurance constraints. We propose a DoRA-based calibration framework that restores accuracy by compensating influential weights with minimal calibr… ▽ More

    Submitted 2 April, 2025; originally announced April 2025.

    Comments: 7 pages, 6 figures

  14. arXiv:2504.02478  [pdf, other

    cs.CV

    MG-MotionLLM: A Unified Framework for Motion Comprehension and Generation across Multiple Granularities

    Authors: Bizhu Wu, Jinheng Xie, Keming Shen, Zhe Kong, Jianfeng Ren, Ruibin Bai, Rong Qu, Linlin Shen

    Abstract: Recent motion-aware large language models have demonstrated promising potential in unifying motion comprehension and generation. However, existing approaches primarily focus on coarse-grained motion-text modeling, where text describes the overall semantics of an entire motion sequence in just a few words. This limits their ability to handle fine-grained motion-relevant tasks, such as understanding… ▽ More

    Submitted 3 April, 2025; originally announced April 2025.

  15. arXiv:2504.00883  [pdf, other

    cs.CV cs.AI

    Improved Visual-Spatial Reasoning via R1-Zero-Like Training

    Authors: Zhenyi Liao, Qingsong Xie, Yanhao Zhang, Zijian Kong, Haonan Lu, Zhenyu Yang, Zhijie Deng

    Abstract: Increasing attention has been placed on improving the reasoning capacities of multi-modal large language models (MLLMs). As the cornerstone for AI agents that function in the physical realm, video-based visual-spatial intelligence (VSI) emerges as one of the most pivotal reasoning capabilities of MLLMs. This work conducts a first, in-depth study on improving the visual-spatial reasoning of MLLMs v… ▽ More

    Submitted 14 April, 2025; v1 submitted 1 April, 2025; originally announced April 2025.

  16. arXiv:2503.10970  [pdf, other

    cs.AI cs.LG

    TxAgent: An AI Agent for Therapeutic Reasoning Across a Universe of Tools

    Authors: Shanghua Gao, Richard Zhu, Zhenglun Kong, Ayush Noori, Xiaorui Su, Curtis Ginder, Theodoros Tsiligkaridis, Marinka Zitnik

    Abstract: Precision therapeutics require multimodal adaptive models that generate personalized treatment recommendations. We introduce TxAgent, an AI agent that leverages multi-step reasoning and real-time biomedical knowledge retrieval across a toolbox of 211 tools to analyze drug interactions, contraindications, and patient-specific treatment strategies. TxAgent evaluates how drugs interact at molecular,… ▽ More

    Submitted 13 March, 2025; originally announced March 2025.

    Comments: Project page: https://zitniklab.hms.harvard.edu/TxAgent TxAgent code: https://github.com/mims-harvard/TxAgent ToolUniverse code: https://github.com/mims-harvard/ToolUniverse

  17. arXiv:2503.03983  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

    Authors: Sreyan Ghosh, Zhifeng Kong, Sonal Kumar, S Sakshi, Jaehyeon Kim, Wei Ping, Rafael Valle, Dinesh Manocha, Bryan Catanzaro

    Abstract: Understanding and reasoning over non-speech sounds and music are crucial for both humans and AI agents to interact effectively with their environments. In this paper, we introduce Audio Flamingo 2 (AF2), an Audio-Language Model (ALM) with advanced audio understanding and reasoning capabilities. AF2 leverages (i) a custom CLAP model, (ii) synthetic Audio QA data for fine-grained audio reasoning, an… ▽ More

    Submitted 5 March, 2025; originally announced March 2025.

  18. arXiv:2502.19860  [pdf, other

    cs.CL cs.AI

    MIND: Towards Immersive Psychological Healing with Multi-agent Inner Dialogue

    Authors: Yujia Chen, Changsong Li, Yiming Wang, Qingqing Xiao, Nan Zhang, Zifan Kong, Peng Wang, Binyu Yan

    Abstract: Mental health issues are worsening in today's competitive society, such as depression and anxiety. Traditional healings like counseling and chatbots fail to engage effectively, they often provide generic responses lacking emotional depth. Although large language models (LLMs) have the potential to create more human-like interactions, they still struggle to capture subtle emotions. This requires LL… ▽ More

    Submitted 27 February, 2025; originally announced February 2025.

  19. arXiv:2502.14456  [pdf, ps, other

    cs.AI

    Narrative-Driven Travel Planning: Geoculturally-Grounded Script Generation with Evolutionary Itinerary Optimization

    Authors: Ziyu Zhang, Ran Ding, Ying Zhu, Ziqian Kong, Peilan Xu

    Abstract: To enhance tourists' experiences and immersion, this paper proposes a narrative-driven travel planning framework called NarrativeGuide, which generates a geoculturally-grounded narrative script for travelers, offering a novel, role-playing experience for their journey. In the initial stage, NarrativeGuide constructs a knowledge graph for attractions within a city, then configures the worldview, ch… ▽ More

    Submitted 8 June, 2025; v1 submitted 20 February, 2025; originally announced February 2025.

  20. arXiv:2502.11508  [pdf, other

    cs.CL cs.AI

    Chinese Spelling Correction: A Comprehensive Survey of Progress, Challenges, and Opportunities

    Authors: Changchun Liu, Kai Zhang, Junzhe Jiang, Zixiao Kong, Qi Liu, Enhong Chen

    Abstract: Chinese Spelling Correction (CSC) is a critical task in natural language processing, aimed at detecting and correcting spelling errors in Chinese text. This survey provides a comprehensive overview of CSC, tracing its evolution from pre-trained language models to large language models, and critically analyzing their respective strengths and weaknesses in this domain. Moreover, we further present a… ▽ More

    Submitted 17 February, 2025; originally announced February 2025.

  21. arXiv:2501.11311  [pdf, other

    cs.SD cs.LG eess.AS

    A2SB: Audio-to-Audio Schrodinger Bridges

    Authors: Zhifeng Kong, Kevin J Shih, Weili Nie, Arash Vahdat, Sang-gil Lee, Joao Felipe Santos, Ante Jukic, Rafael Valle, Bryan Catanzaro

    Abstract: Audio in the real world may be perturbed due to numerous factors, causing the audio quality to be degraded. The following work presents an audio restoration model tailored for high-res music at 44.1kHz. Our model, Audio-to-Audio Schrodinger Bridges (A2SB), is capable of both bandwidth extension (predicting high-frequency components) and inpainting (re-generating missing segments). Critically, A2SB… ▽ More

    Submitted 20 January, 2025; originally announced January 2025.

  22. arXiv:2501.08834  [pdf, other

    cs.CR cs.SE

    Smart Contract Fuzzing Towards Profitable Vulnerabilities

    Authors: Ziqiao Kong, Cen Zhang, Maoyi Xie, Ming Hu, Yue Xue, Ye Liu, Haijun Wang, Yang Liu

    Abstract: Billions of dollars are transacted through smart contracts, making vulnerabilities a major financial risk. One focus in the security arms race is on profitable vulnerabilities that attackers can exploit. Fuzzing is a key method for identifying these vulnerabilities. However, current solutions face two main limitations: a lack of profit-centric techniques for expediting detection, and insufficient… ▽ More

    Submitted 12 February, 2025; v1 submitted 15 January, 2025; originally announced January 2025.

    Comments: Camera-ready version

    Journal ref: FSE 2025

  23. arXiv:2501.04315  [pdf, other

    cs.LG cs.AI

    RoRA: Efficient Fine-Tuning of LLM with Reliability Optimization for Rank Adaptation

    Authors: Jun Liu, Zhenglun Kong, Peiyan Dong, Changdi Yang, Xuan Shen, Pu Zhao, Hao Tang, Geng Yuan, Wei Niu, Wenbin Zhang, Xue Lin, Dong Huang, Yanzhi Wang

    Abstract: Fine-tuning helps large language models (LLM) recover degraded information and enhance task performance. Although Low-Rank Adaptation (LoRA) is widely used and effective for fine-tuning, we have observed that its scaling factor can limit or even reduce performance as the rank size increases. To address this issue, we propose RoRA (Rank-adaptive Reliability Optimization), a simple yet effective met… ▽ More

    Submitted 11 January, 2025; v1 submitted 8 January, 2025; originally announced January 2025.

    Comments: 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  24. arXiv:2412.21037  [pdf, other

    cs.SD cs.AI cs.CL eess.AS

    TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization

    Authors: Chia-Yu Hung, Navonil Majumder, Zhifeng Kong, Ambuj Mehrish, Amir Ali Bagherzadeh, Chuan Li, Rafael Valle, Bryan Catanzaro, Soujanya Poria

    Abstract: We introduce TangoFlux, an efficient Text-to-Audio (TTA) generative model with 515M parameters, capable of generating up to 30 seconds of 44.1kHz audio in just 3.7 seconds on a single A40 GPU. A key challenge in aligning TTA models lies in the difficulty of creating preference pairs, as TTA lacks structured mechanisms like verifiable rewards or gold-standard answers available for Large Language Mo… ▽ More

    Submitted 10 April, 2025; v1 submitted 30 December, 2024; originally announced December 2024.

    Comments: https://tangoflux.github.io/

  25. arXiv:2412.19351  [pdf, ps, other

    cs.SD cs.CL cs.LG eess.AS

    ETTA: Elucidating the Design Space of Text-to-Audio Models

    Authors: Sang-gil Lee, Zhifeng Kong, Arushi Goel, Sungwon Kim, Rafael Valle, Bryan Catanzaro

    Abstract: Recent years have seen significant progress in Text-To-Audio (TTA) synthesis, enabling users to enrich their creative workflows with synthetic audio generated from natural language prompts. Despite this progress, the effects of data, model architecture, training objective functions, and sampling strategies on target benchmarks are not well understood. With the purpose of providing a holistic under… ▽ More

    Submitted 30 June, 2025; v1 submitted 26 December, 2024; originally announced December 2024.

    Comments: ICML 2025. Demo: https://research.nvidia.com/labs/adlr/ETTA/ Code: https://github.com/NVIDIA/elucidated-text-to-audio

  26. arXiv:2412.06845  [pdf, ps, other

    cs.CL cs.AI cs.LG

    7B Fully Open Source Moxin-LLM/VLM -- From Pretraining to GRPO-based Reinforcement Learning Enhancement

    Authors: Pu Zhao, Xuan Shen, Zhenglun Kong, Yixin Shen, Sung-En Chang, Arash Akbari, Timothy Rupprecht, Lei Lu, Enfu Nan, Changdi Yang, Yumei He, Weiyan Shi, Xingchen Xu, Yu Huang, Wei Jiang, Wei Wang, Yue Chen, Yong He, Yanzhi Wang

    Abstract: Recently, Large Language Models (LLMs) have undergone a significant transformation, marked by a rapid rise in both their popularity and capabilities. Leading this evolution are proprietary LLMs like GPT-4 and GPT-o1, which have captured widespread attention in the AI community due to their remarkable performance and versatility. Simultaneously, open-source LLMs, such as LLaMA, have made great cont… ▽ More

    Submitted 3 July, 2025; v1 submitted 7 December, 2024; originally announced December 2024.

  27. arXiv:2411.15215  [pdf, other

    cs.LG cs.AI q-bio.BM

    S$^2$ALM: Sequence-Structure Pre-trained Large Language Model for Comprehensive Antibody Representation Learning

    Authors: Mingze Yin, Hanjing Zhou, Jialu Wu, Yiheng Zhu, Yuxuan Zhan, Zitai Kong, Hongxia Xu, Chang-Yu Hsieh, Jintai Chen, Tingjun Hou, Jian Wu

    Abstract: Antibodies safeguard our health through their precise and potent binding to specific antigens, demonstrating promising therapeutic efficacy in the treatment of numerous diseases, including COVID-19. Recent advancements in biomedical language models have shown the great potential to interpret complex biological structures and functions. However, existing antibody specific models have a notable limi… ▽ More

    Submitted 20 November, 2024; originally announced November 2024.

  28. High-Fidelity Cellular Network Control-Plane Traffic Generation without Domain Knowledge

    Authors: Z. Jonny Kong, Nathan Hu, Y. Charlie Hu, Jiayi Meng, Yaron Koral

    Abstract: With rapid evolution of mobile core network (MCN) architectures, large-scale control-plane traffic (CPT) traces are critical to studying MCN design and performance optimization by the R&D community. The prior-art control-plane traffic generator SMM heavily relies on domain knowledge which requires re-design as the domain evolves. In this work, we study the feasibility of developing a high-fidelity… ▽ More

    Submitted 11 November, 2024; originally announced November 2024.

  29. arXiv:2411.01171  [pdf, other

    cs.CV cs.AI

    Fast and Memory-Efficient Video Diffusion Using Streamlined Inference

    Authors: Zheng Zhan, Yushu Wu, Yifan Gong, Zichong Meng, Zhenglun Kong, Changdi Yang, Geng Yuan, Pu Zhao, Wei Niu, Yanzhi Wang

    Abstract: The rapid progress in artificial intelligence-generated content (AIGC), especially with diffusion models, has significantly advanced development of high-quality video generation. However, current video diffusion models exhibit demanding computational requirements and high peak memory usage, especially for generating longer and higher-resolution videos. These limitations greatly hinder the practica… ▽ More

    Submitted 2 November, 2024; originally announced November 2024.

    Comments: Accepted to NeurIPS 2024

  30. arXiv:2411.00461  [pdf, other

    cs.LG cs.AI eess.SY

    A Multi-Granularity Supervised Contrastive Framework for Remaining Useful Life Prediction of Aero-engines

    Authors: Zixuan He, Ziqian Kong, Zhengyu Chen, Yuling Zhan, Zijun Que, Zhengguo Xu

    Abstract: Accurate remaining useful life (RUL) predictions are critical to the safe operation of aero-engines. Currently, the RUL prediction task is mainly a regression paradigm with only mean square error as the loss function and lacks research on feature space structure, the latter of which has shown excellent performance in a large number of studies. This paper develops a multi-granularity supervised con… ▽ More

    Submitted 14 November, 2024; v1 submitted 1 November, 2024; originally announced November 2024.

  31. arXiv:2410.17585  [pdf, other

    cs.RO

    Energy-Optimal Planning of Waypoint-Based UAV Missions -- Does Minimum Distance Mean Minimum Energy?

    Authors: Nicolas Michel, Ayush Patnaik, Zhaodan Kong, Xinfan Lin

    Abstract: Multirotor unmanned aerial vehicle is a prevailing type of aerial robots with wide real-world applications. The energy efficiency of the robot is a critical aspect of its performance, determining the range and duration of the missions that can be performed. This paper studies the energy-optimal planning of the multirotor, which aims at finding the optimal ordering of waypoints with the minimum ene… ▽ More

    Submitted 23 October, 2024; originally announced October 2024.

    Comments: This paper has been accepted for presentation at the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2024

  32. arXiv:2410.15567  [pdf, other

    cs.LG cs.AI cs.CL

    Pruning Foundation Models for High Accuracy without Retraining

    Authors: Pu Zhao, Fei Sun, Xuan Shen, Pinrui Yu, Zhenglun Kong, Yanzhi Wang, Xue Lin

    Abstract: Despite the superior performance, it is challenging to deploy foundation models or large language models (LLMs) due to their massive parameters and computations. While pruning is a promising technique to reduce model size and accelerate the inference, the traditional pruning techniques can hardly be applied for LLMs as they need to finetune the model on the full dataset with multiple epochs consum… ▽ More

    Submitted 20 October, 2024; originally announced October 2024.

    Comments: Accepted by EMNLP 2024 findings

  33. arXiv:2410.14725  [pdf, other

    cs.LG cs.CL

    Rethinking Token Reduction for State Space Models

    Authors: Zheng Zhan, Yushu Wu, Zhenglun Kong, Changdi Yang, Yifan Gong, Xuan Shen, Xue Lin, Pu Zhao, Yanzhi Wang

    Abstract: Recent advancements in State Space Models (SSMs) have attracted significant interest, particularly in models optimized for parallel training and handling long-range dependencies. Architectures like Mamba have scaled to billions of parameters with selective SSM. To facilitate broader applications using Mamba, exploring its efficiency is crucial. While token reduction techniques offer a straightforw… ▽ More

    Submitted 15 October, 2024; originally announced October 2024.

    Comments: EMNLP 2024

  34. arXiv:2410.14082  [pdf, other

    cs.LG cs.AI

    Interpreting Inflammation Prediction Model via Tag-based Cohort Explanation

    Authors: Fanyu Meng, Jules Larke, Xin Liu, Zhaodan Kong, Xin Chen, Danielle Lemay, Ilias Tagkopoulos

    Abstract: Machine learning is revolutionizing nutrition science by enabling systems to learn from data and make intelligent decisions. However, the complexity of these models often leads to challenges in understanding their decision-making processes, necessitating the development of explainability techniques to foster trust and increase model transparency. An under-explored type of explanation is cohort exp… ▽ More

    Submitted 17 October, 2024; originally announced October 2024.

  35. arXiv:2410.13190  [pdf, other

    cs.LG cs.AI

    CohEx: A Generalized Framework for Cohort Explanation

    Authors: Fanyu Meng, Xin Liu, Zhaodan Kong, Xin Chen

    Abstract: eXplainable Artificial Intelligence (XAI) has garnered significant attention for enhancing transparency and trust in machine learning models. However, the scopes of most existing explanation techniques focus either on offering a holistic view of the explainee model (global explanation) or on individual instances (local explanation), while the middle ground, i.e., cohort-based explanation, is less… ▽ More

    Submitted 11 December, 2024; v1 submitted 16 October, 2024; originally announced October 2024.

  36. arXiv:2410.02056  [pdf, other

    eess.AS cs.AI cs.CL

    Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data

    Authors: Sreyan Ghosh, Sonal Kumar, Zhifeng Kong, Rafael Valle, Bryan Catanzaro, Dinesh Manocha

    Abstract: We present Synthio, a novel approach for augmenting small-scale audio classification datasets with synthetic data. Our goal is to improve audio classification accuracy with limited labeled data. Traditional data augmentation techniques, which apply artificial transformations (e.g., adding random noise or masking segments), struggle to create data that captures the true diversity present in real-wo… ▽ More

    Submitted 11 March, 2025; v1 submitted 2 October, 2024; originally announced October 2024.

    Comments: Accepted at ICLR 2025. Code and Checkpoints available here: https://github.com/Sreyan88/Synthio

  37. arXiv:2409.18962  [pdf, other

    cs.CV cs.AI cs.LG

    Exploring Token Pruning in Vision State Space Models

    Authors: Zheng Zhan, Zhenglun Kong, Yifan Gong, Yushu Wu, Zichong Meng, Hangyu Zheng, Xuan Shen, Stratis Ioannidis, Wei Niu, Pu Zhao, Yanzhi Wang

    Abstract: State Space Models (SSMs) have the advantage of keeping linear computational complexity compared to attention modules in transformers, and have been applied to vision tasks as a new type of powerful vision foundation model. Inspired by the observations that the final prediction in vision transformers (ViTs) is only based on a subset of most informative tokens, we take the novel step of enhancing t… ▽ More

    Submitted 27 September, 2024; originally announced September 2024.

    Comments: NeurIPS'24

  38. arXiv:2409.17372  [pdf, ps, other

    cs.AI

    Search for Efficient Large Language Models

    Authors: Xuan Shen, Pu Zhao, Yifan Gong, Zhenglun Kong, Zheng Zhan, Yushu Wu, Ming Lin, Chao Wu, Xue Lin, Yanzhi Wang

    Abstract: Large Language Models (LLMs) have long held sway in the realms of artificial intelligence research. Numerous efficient techniques, including weight pruning, quantization, and distillation, have been embraced to compress LLMs, targeting memory reduction and inference acceleration, which underscore the redundancy in LLMs. However, most model compression techniques concentrate on weight optimization,… ▽ More

    Submitted 30 October, 2024; v1 submitted 25 September, 2024; originally announced September 2024.

    Comments: Accepted by NeurIPS 2024

  39. arXiv:2409.07447  [pdf, other

    cs.CV cs.GR

    StereoCrafter: Diffusion-based Generation of Long and High-fidelity Stereoscopic 3D from Monocular Videos

    Authors: Sijie Zhao, Wenbo Hu, Xiaodong Cun, Yong Zhang, Xiaoyu Li, Zhe Kong, Xiangjun Gao, Muyao Niu, Ying Shan

    Abstract: This paper presents a novel framework for converting 2D videos to immersive stereoscopic 3D, addressing the growing demand for 3D content in immersive experience. Leveraging foundation models as priors, our approach overcomes the limitations of traditional methods and boosts the performance to ensure the high-fidelity generation required by the display devices. The proposed system consists of two… ▽ More

    Submitted 11 September, 2024; originally announced September 2024.

    Comments: 11 pages, 10 figures

    ACM Class: I.3.0; I.4.0

  40. arXiv:2408.12333  [pdf, other

    cs.AI

    GRATR: Zero-Shot Evidence Graph Retrieval-Augmented Trustworthiness Reasoning

    Authors: Ying Zhu, Shengchang Li, Ziqian Kong, Qiang Yang, Peilan Xu

    Abstract: Trustworthiness reasoning aims to enable agents in multiplayer games with incomplete information to identify potential allies and adversaries, thereby enhancing decision-making. In this paper, we introduce the graph retrieval-augmented trustworthiness reasoning (GRATR) framework, which retrieves observable evidence from the game environment to inform decision-making by large language models (LLMs)… ▽ More

    Submitted 27 January, 2025; v1 submitted 22 August, 2024; originally announced August 2024.

  41. arXiv:2408.05923  [pdf, other

    eess.IV cs.CV

    Image Denoising Using Green Channel Prior

    Authors: Zhaoming Kong, Fangxi Deng, Xiaowei Yang

    Abstract: Image denoising is an appealing and challenging task, in that noise statistics of real-world observations may vary with local image contents and different image channels. Specifically, the green channel usually has twice the sampling rate in raw data. To handle noise variances and leverage such channel-wise prior information, we propose a simple and effective green channel prior-based image denois… ▽ More

    Submitted 12 August, 2024; originally announced August 2024.

    Comments: arXiv admin note: text overlap with arXiv:2402.08235

  42. arXiv:2408.00238  [pdf, other

    cs.HC

    Anytime Trust Rating Dynamics in a Human-Robot Interaction Task

    Authors: Jason Dekarske, Gregory Bales, Zhaodan Kong, Sanjay Joshi

    Abstract: Objective We model factors contributing to rating timing for a single-dimensional, any-time trust in robotics measure. Background Many studies view trust as a slow-changing value after subjects complete a trial or at regular intervals. Trust is a multifaceted concept that can be measured simultaneously with a human-robot interaction. Method 65 subjects commanded a remote robot arm in a simulat… ▽ More

    Submitted 31 July, 2024; originally announced August 2024.

  43. arXiv:2407.20893  [pdf, other

    cs.LG cs.AI eess.SP

    MambaCapsule: Towards Transparent Cardiac Disease Diagnosis with Electrocardiography Using Mamba Capsule Network

    Authors: Yinlong Xu, Xiaoqiang Liu, Zitai Kong, Yixuan Wu, Yue Wang, Yingzhou Lu, Honghao Gao, Jian Wu, Hongxia Xu

    Abstract: Cardiac arrhythmia, a condition characterized by irregular heartbeats, often serves as an early indication of various heart ailments. With the advent of deep learning, numerous innovative models have been introduced for diagnosing arrhythmias using Electrocardiogram (ECG) signals. However, recent studies solely focus on the performance of models, neglecting the interpretation of their results. Thi… ▽ More

    Submitted 30 July, 2024; originally announced July 2024.

  44. arXiv:2407.18175  [pdf, other

    cs.LG cs.AI cs.CV

    Quasar-ViT: Hardware-Oriented Quantization-Aware Architecture Search for Vision Transformers

    Authors: Zhengang Li, Alec Lu, Yanyue Xie, Zhenglun Kong, Mengshu Sun, Hao Tang, Zhong Jia Xue, Peiyan Dong, Caiwen Ding, Yanzhi Wang, Xue Lin, Zhenman Fang

    Abstract: Vision transformers (ViTs) have demonstrated their superior accuracy for computer vision tasks compared to convolutional neural networks (CNNs). However, ViT models are often computation-intensive for efficient deployment on resource-limited edge devices. This work proposes Quasar-ViT, a hardware-oriented quantization-aware architecture search framework for ViTs, to design efficient ViT models for… ▽ More

    Submitted 25 July, 2024; originally announced July 2024.

    Comments: Accepted by ICS 2024

  45. arXiv:2407.16641  [pdf, other

    cs.LG cs.AI

    A Geometry-Aware Algorithm to Learn Hierarchical Embeddings in Hyperbolic Space

    Authors: Zhangyu Wang, Lantian Xu, Zhifeng Kong, Weilong Wang, Xuyu Peng, Enyang Zheng

    Abstract: Hyperbolic embeddings are a class of representation learning methods that offer competitive performances when data can be abstracted as a tree-like graph. However, in practice, learning hyperbolic embeddings of hierarchical data is difficult due to the different geometry between hyperbolic space and the Euclidean space. To address such difficulties, we first categorize three kinds of illness that… ▽ More

    Submitted 23 July, 2024; originally announced July 2024.

  46. arXiv:2406.18873  [pdf, other

    cs.AR

    LayoutCopilot: An LLM-powered Multi-agent Collaborative Framework for Interactive Analog Layout Design

    Authors: Bingyang Liu, Haoyi Zhang, Xiaohan Gao, Zichen Kong, Xiyuan Tang, Yibo Lin, Runsheng Wang, Ru Huang

    Abstract: Analog layout design heavily involves interactive processes between humans and design tools. Electronic Design Automation (EDA) tools for this task are usually designed to use scripting commands or visualized buttons for manipulation, especially for interactive automation functionalities, which have a steep learning curve and cumbersome user experience, making a notable barrier to designers' adopt… ▽ More

    Submitted 13 January, 2025; v1 submitted 26 June, 2024; originally announced June 2024.

    Comments: 8pages, 8figures

  47. arXiv:2406.15487  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Improving Text-To-Audio Models with Synthetic Captions

    Authors: Zhifeng Kong, Sang-gil Lee, Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, Rafael Valle, Soujanya Poria, Bryan Catanzaro

    Abstract: It is an open challenge to obtain high quality training data, especially captions, for text-to-audio models. Although prior methods have leveraged \textit{text-only language models} to augment and improve captions, such methods have limitations related to scale and coherence between audio and captions. In this work, we propose an audio captioning pipeline that uses an \textit{audio language model}… ▽ More

    Submitted 8 July, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

  48. arXiv:2405.03234  [pdf, ps, other

    cs.HC cs.LG

    A Reliable Framework for Human-in-the-Loop Anomaly Detection in Time Series

    Authors: Ziquan Deng, Xiwei Xuan, Kwan-Liu Ma, Zhaodan Kong

    Abstract: Time series anomaly detection is a critical machine learning task for numerous applications, such as finance, healthcare, and industrial systems. However, even high-performing models may exhibit potential issues such as biases, leading to unreliable outcomes and misplaced confidence. While model explanation techniques, particularly visual explanations, offer valuable insights by elucidating model… ▽ More

    Submitted 23 June, 2025; v1 submitted 6 May, 2024; originally announced May 2024.

    Comments: The manuscript is currently under review

  49. arXiv:2404.19291  [pdf, other

    cs.HC

    Dynamic Human Trust Modeling of Autonomous Agents With Varying Capability and Strategy

    Authors: Jason Dekarske, Zhaodan Kong, Sanjay Joshi

    Abstract: Objective We model the dynamic trust of human subjects in a human-autonomy-teaming screen-based task. Background Trust is an emerging area of study in human-robot collaboration. Many studies have looked at the issue of robot performance as a sole predictor of human trust, but this could underestimate the complexity of the interaction. Method Subjects were paired with autonomous agents to searc… ▽ More

    Submitted 30 April, 2024; originally announced April 2024.

  50. arXiv:2404.18961  [pdf, other

    cs.LG cs.AI cs.CV

    Unleashing the Power of Multi-Task Learning: A Comprehensive Survey Spanning Traditional, Deep, and Pretrained Foundation Model Eras

    Authors: Jun Yu, Yutong Dai, Xiaokang Liu, Jin Huang, Yishan Shen, Ke Zhang, Rong Zhou, Eashan Adhikarla, Wenxuan Ye, Yixin Liu, Zhaoming Kong, Kai Zhang, Yilong Yin, Vinod Namboodiri, Brian D. Davison, Jason H. Moore, Yong Chen

    Abstract: MTL is a learning paradigm that effectively leverages both task-specific and shared information to address multiple related tasks simultaneously. In contrast to STL, MTL offers a suite of benefits that enhance both the training process and the inference efficiency. MTL's key advantages encompass streamlined model architecture, performance enhancement, and cross-domain generalizability. Over the pa… ▽ More

    Submitted 29 April, 2024; originally announced April 2024.

    Comments: 60 figures, 116 pages, 500+ references