Search | arXiv e-print repository

VQTalker: Towards Multilingual Talking Avatars through Facial Motion Tokenization

Authors: Tao Liu, Ziyang Ma, Qi Chen, Feilong Chen, Shuai Fan, Xie Chen, Kai Yu

Abstract: We present VQTalker, a Vector Quantization-based framework for multilingual talking head generation that addresses the challenges of lip synchronization and natural motion across diverse languages. Our approach is grounded in the phonetic principle that human speech comprises a finite set of distinct sound units (phonemes) and corresponding visual articulations (visemes), which often share commona… ▽ More We present VQTalker, a Vector Quantization-based framework for multilingual talking head generation that addresses the challenges of lip synchronization and natural motion across diverse languages. Our approach is grounded in the phonetic principle that human speech comprises a finite set of distinct sound units (phonemes) and corresponding visual articulations (visemes), which often share commonalities across languages. We introduce a facial motion tokenizer based on Group Residual Finite Scalar Quantization (GRFSQ), which creates a discretized representation of facial features. This method enables comprehensive capture of facial movements while improving generalization to multiple languages, even with limited training data. Building on this quantized representation, we implement a coarse-to-fine motion generation process that progressively refines facial animations. Extensive experiments demonstrate that VQTalker achieves state-of-the-art performance in both video-driven and speech-driven scenarios, particularly in multilingual settings. Notably, our method achieves high-quality results at a resolution of 512*512 pixels while maintaining a lower bitrate of approximately 11 kbps. Our work opens new possibilities for cross-lingual talking face generation. Synthetic results can be viewed at https://x-lance.github.io/VQTalker. △ Less

Submitted 18 December, 2024; v1 submitted 13 December, 2024; originally announced December 2024.

Comments: 14 pages

arXiv:2412.09028 [pdf, other]

Learning and Current Prediction of PMSM Drive via Differential Neural Networks

Authors: Wenjie Mei, Xiaorui Wang, Yanrong Lu, Ke Yu, Shihua Li

Abstract: Learning models for dynamical systems in continuous time is significant for understanding complex phenomena and making accurate predictions. This study presents a novel approach utilizing differential neural networks (DNNs) to model nonlinear systems, specifically permanent magnet synchronous motors (PMSMs), and to predict their current trajectories. The efficacy of our approach is validated throu… ▽ More Learning models for dynamical systems in continuous time is significant for understanding complex phenomena and making accurate predictions. This study presents a novel approach utilizing differential neural networks (DNNs) to model nonlinear systems, specifically permanent magnet synchronous motors (PMSMs), and to predict their current trajectories. The efficacy of our approach is validated through experiments conducted under various load disturbances and no-load conditions. The results demonstrate that our method effectively and accurately reconstructs the original systems, showcasing strong short-term and long-term prediction capabilities and robustness. This study provides valuable insights into learning the inherent dynamics of complex dynamical data and holds potential for further applications in fields such as weather forecasting, robotics, and collective behavior analysis. △ Less

Submitted 12 December, 2024; originally announced December 2024.

arXiv:2412.08443 [pdf, other]

POINTS1.5: Building a Vision-Language Model towards Real World Applications

Authors: Yuan Liu, Le Tian, Xiao Zhou, Xinyu Gao, Kavio Yu, Yang Yu, Jie Zhou

Abstract: Vision-language models have made significant strides recently, demonstrating superior performance across a range of tasks, e.g. optical character recognition and complex diagram analysis. Building on this trend, we introduce a new vision-language model, POINTS1.5, designed to excel in various real-world applications. POINTS1.5 is an enhancement of POINTS1.0 and incorporates several key innovations… ▽ More Vision-language models have made significant strides recently, demonstrating superior performance across a range of tasks, e.g. optical character recognition and complex diagram analysis. Building on this trend, we introduce a new vision-language model, POINTS1.5, designed to excel in various real-world applications. POINTS1.5 is an enhancement of POINTS1.0 and incorporates several key innovations: i) We replace the original CLIP vision encoder, which had a fixed image resolution, with a NaViT-style vision encoder that supports native dynamic high resolution. This allows POINTS1.5 to process images of any resolution without needing to split them into tiles. ii) We add bilingual support to POINTS1.5, significantly enhancing its capability in Chinese. Due to the scarcity of open-source Chinese datasets for vision-language models, we collect numerous images from the Internet and annotate them using a combination of manual and automatic methods. iii) We propose a set of rigorous filtering methods for visual instruction tuning datasets. We comprehensively evaluate all these filtering methods, and choose the most effective ones to obtain the final visual instruction tuning set. Thanks to these innovations, POINTS1.5 significantly outperforms POINTS1.0 and demonstrates strong performance across a range of real-world applications. Notably, POINTS1.5-7B is trained on fewer than 4 billion tokens and ranks first on the OpenCompass leaderboard among models with fewer than 10 billion parameters △ Less

Submitted 11 December, 2024; originally announced December 2024.

arXiv:2412.08218 [pdf, other]

Maximal Clique Enumeration with Hybrid Branching and Early Termination

Authors: Kaixin Wang, Kaiqiang Yu, Cheng Long

Abstract: Maximal clique enumeration (MCE) is crucial for tasks like community detection and biological network analysis. Existing algorithms typically adopt the branch-and-bound framework with the vertex-oriented Bron-Kerbosch (BK) branching strategy, which forms the sub-branches by expanding the partial clique with a vertex. In this paper, we present a novel approach called HBBMC, a hybrid framework combi… ▽ More Maximal clique enumeration (MCE) is crucial for tasks like community detection and biological network analysis. Existing algorithms typically adopt the branch-and-bound framework with the vertex-oriented Bron-Kerbosch (BK) branching strategy, which forms the sub-branches by expanding the partial clique with a vertex. In this paper, we present a novel approach called HBBMC, a hybrid framework combining vertex-oriented BK branching and edge-oriented BK branching, where the latter adopts a branch-and-bound framework which forms the sub-branches by expanding the partial clique with an edge. This hybrid strategy enables more effective pruning and helps achieve a worst-case time complexity better than the best known one under a condition that holds for the majority of real-world graphs. To further enhance efficiency, we introduce an early termination technique, which leverages the topological information of the graphs and constructs the maximal cliques directly without branching. Our early termination technique is applicable to all branch-and-bound frameworks. Extensive experiments demonstrate the superior performance of our techniques. △ Less

Submitted 11 December, 2024; originally announced December 2024.

Comments: Accepted by ICDE'25

arXiv:2412.06217 [pdf, other]

doi 10.1002/adom.202403420

Large Bidirectional Refractive Index Change in Silicon-rich Nitride via Visible Light Trimming

Authors: Dmitrii Belogolovskii, Md Masudur Rahman, Karl Johnson, Vladimir Fedorov, Andrew Grieco, Nikola Alic, Abdoulaye Ndao, Paul K. L. Yu, Yeshaiahu Fainman

Abstract: Phase-sensitive integrated photonic devices are highly susceptible to minor manufacturing deviations, resulting in significant performance inconsistencies. This variability has limited the scalability and widespread adoption of these devices. Here, a major advancement is achieved through continuous-wave (CW) visible light (405 nm and 520 nm) trimming of plasma-enhanced chemical vapor deposition (P… ▽ More Phase-sensitive integrated photonic devices are highly susceptible to minor manufacturing deviations, resulting in significant performance inconsistencies. This variability has limited the scalability and widespread adoption of these devices. Here, a major advancement is achieved through continuous-wave (CW) visible light (405 nm and 520 nm) trimming of plasma-enhanced chemical vapor deposition (PECVD) silicon-rich nitride (SRN) waveguides. The demonstrated method achieves precise, bidirectional refractive index tuning with a single laser source in CMOS-compatible SRN samples with refractive indices of 2.4 and 2.9 (measured at 1550 nm). By utilizing a cost-effective setup for real-time resonance tracking in micro-ring resonators, the resonant wavelength shifts as fine as 10 pm are attained. Additionally, a record red shift of 49.1 nm and a substantial blue shift of 10.6 nm are demonstrated, corresponding to refractive index changes of approximately 0.11 and -0.02. The blue and red shifts are both conclusively attributed to thermal annealing. These results highlight SRN's exceptional capability for permanent optical tuning, establishing a foundation for stable, precisely controlled performance in phase-sensitive integrated photonic devices. △ Less

Submitted 15 February, 2025; v1 submitted 9 December, 2024; originally announced December 2024.

Comments: 23 pages, 11 figures. Replacement reason: Minor changes only to fix typos and improve clarity

arXiv:2412.05004 [pdf, other]

Prompt Transfer for Dual-Aspect Cross Domain Cognitive Diagnosis

Authors: Fei Liu, Yizhong Zhang, Shuochen Liu, Shengwei Ji, Kui Yu, Le Wu

Abstract: Cognitive Diagnosis (CD) aims to evaluate students' cognitive states based on their interaction data, enabling downstream applications such as exercise recommendation and personalized learning guidance. However, existing methods often struggle with accuracy drops in cross-domain cognitive diagnosis (CDCD), a practical yet challenging task. While some efforts have explored exercise-aspect CDCD, suc… ▽ More Cognitive Diagnosis (CD) aims to evaluate students' cognitive states based on their interaction data, enabling downstream applications such as exercise recommendation and personalized learning guidance. However, existing methods often struggle with accuracy drops in cross-domain cognitive diagnosis (CDCD), a practical yet challenging task. While some efforts have explored exercise-aspect CDCD, such as crosssubject scenarios, they fail to address the broader dual-aspect nature of CDCD, encompassing both student- and exerciseaspect variations. This diversity creates significant challenges in developing a scenario-agnostic framework. To address these gaps, we propose PromptCD, a simple yet effective framework that leverages soft prompt transfer for cognitive diagnosis. PromptCD is designed to adapt seamlessly across diverse CDCD scenarios, introducing PromptCD-S for student-aspect CDCD and PromptCD-E for exercise-aspect CDCD. Extensive experiments on real-world datasets demonstrate the robustness and effectiveness of PromptCD, consistently achieving superior performance across various CDCD scenarios. Our work offers a unified and generalizable approach to CDCD, advancing both theoretical and practical understanding in this critical domain. The implementation of our framework is publicly available at https://github.com/Publisher-PromptCD/PromptCD. △ Less

Submitted 6 December, 2024; originally announced December 2024.

arXiv:2412.04729 [pdf, other]

Espresso: High Compression For Rich Extraction From Videos for Your Vision-Language Model

Authors: Keunwoo Peter Yu, Achal Dave, Rares Ambrus, Jean Mercat

Abstract: Recent advances in vision-language models (VLMs) have shown great promise in connecting images and text, but extending these models to long videos remains challenging due to the rapid growth in token counts. Models that compress videos by local aggregation in time or space have become popular for handling long-form inputs; however, these pooling-based projectors sacrifice the benefits of fixed-len… ▽ More Recent advances in vision-language models (VLMs) have shown great promise in connecting images and text, but extending these models to long videos remains challenging due to the rapid growth in token counts. Models that compress videos by local aggregation in time or space have become popular for handling long-form inputs; however, these pooling-based projectors sacrifice the benefits of fixed-length representations that are crucial for streaming and efficient video understanding. We introduce $\texttt{Espresso}$, a new architecture that separately compresses spatial and temporal features into fixed-length sequences. $\texttt{Espresso}$ enables efficient video encoding while maintaining strong long-form reasoning capabilities. Experiments show that fixed-length compression combined with segment-wise processing offers a scalable and competitive alternative to pooling-based approaches. Our results demonstrate that fixed-length projectors, when properly designed and trained, remain a viable foundation for video-language modeling. △ Less

Submitted 16 May, 2025; v1 submitted 5 December, 2024; originally announced December 2024.

Comments: 16 pages

arXiv:2412.04141 [pdf, ps, other]

Reducing Tool Hallucination via Reliability Alignment

Authors: Hongshen Xu, Zichen Zhu, Lei Pan, Zihan Wang, Su Zhu, Da Ma, Ruisheng Cao, Lu Chen, Kai Yu

Abstract: Large Language Models (LLMs) have expanded their capabilities beyond language generation to interact with external tools, enabling automation and real-world applications. However, tool hallucinations, where models either select inappropriate tools or misuse them, pose significant challenges, leading to erroneous task execution, increased computational costs, and reduced system reliability. To syst… ▽ More Large Language Models (LLMs) have expanded their capabilities beyond language generation to interact with external tools, enabling automation and real-world applications. However, tool hallucinations, where models either select inappropriate tools or misuse them, pose significant challenges, leading to erroneous task execution, increased computational costs, and reduced system reliability. To systematically address this issue, we define and categorize tool hallucinations into two main types, tool selection hallucination and tool usage hallucination. To evaluate and mitigate these issues, we introduce RelyToolBench, which integrates specialized test cases and novel metrics to assess hallucination-aware task success and efficiency. Finally, we propose Relign, a reliability alignment framework that expands the tool-use action space to include indecisive actions, allowing LLMs to defer tool use, seek clarification, or adjust tool selection dynamically. Through extensive experiments, we demonstrate that Relign significantly reduces tool hallucinations, improves task reliability, and enhances the efficiency of LLM tool interactions. △ Less

Submitted 29 May, 2025; v1 submitted 5 December, 2024; originally announced December 2024.

arXiv:2412.02252 [pdf, other]

Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity

Authors: Da Ma, Lu Chen, Situo Zhang, Yuxun Miao, Su Zhu, Zhi Chen, Hongshen Xu, Hanqi Li, Shuai Fan, Lei Pan, Kai Yu

Abstract: The increasing context window size in Large Language Models (LLMs), such as the GPT and LLaMA series, has improved their ability to tackle complex, long-text tasks, but at the cost of inference efficiency, particularly regarding memory and computational complexity. Existing methods, including selective token retention and window-based attention, improve efficiency but risk discarding important tok… ▽ More The increasing context window size in Large Language Models (LLMs), such as the GPT and LLaMA series, has improved their ability to tackle complex, long-text tasks, but at the cost of inference efficiency, particularly regarding memory and computational complexity. Existing methods, including selective token retention and window-based attention, improve efficiency but risk discarding important tokens needed for future text generation. In this paper, we propose an approach that enhances LLM efficiency without token loss by reducing the memory and computational load of less important tokens, rather than discarding them.We address two challenges: 1) investigating the distribution of important tokens in the context, discovering recent tokens are more important than distant tokens in context, and 2) optimizing resources for distant tokens by sharing attention scores across layers. The experiments show that our method saves $35\%$ KV cache without compromising the performance. △ Less

Submitted 3 December, 2024; originally announced December 2024.

Comments: preprint

arXiv:2412.00645 [pdf, ps, other]

Quantum Convolutional Neural Network with Flexible Stride

Authors: Kai Yu, Song Lin, Bin-Bin Cai

Abstract: Convolutional neural network is a crucial tool for machine learning, especially in the field of computer vision. Its unique structure and characteristics provide significant advantages in feature extraction. However, with the exponential growth of data scale, classical computing architectures face serious challenges in terms of time efficiency and memory requirements. In this paper, we propose a n… ▽ More Convolutional neural network is a crucial tool for machine learning, especially in the field of computer vision. Its unique structure and characteristics provide significant advantages in feature extraction. However, with the exponential growth of data scale, classical computing architectures face serious challenges in terms of time efficiency and memory requirements. In this paper, we propose a novel quantum convolutional neural network algorithm. It can flexibly adjust the stride to accommodate different tasks while ensuring that the required qubits do not increase proportionally with the size of the sliding window. First, a data loading method based on quantum superposition is presented, which is able to exponentially reduce space requirements. Subsequently, quantum subroutines for convolutional layers, pooling layers, and fully connected layers are designed, fully replicating the core functions of classical convolutional neural networks. Among them, the quantum arithmetic technique is introduced to recover the data position information of the corresponding receptive field through the position information of the feature, which makes the selection of step size more flexible. Moreover, parallel quantum amplitude estimation and swap test techniques are employed, enabling parallel feature extraction. Analysis shows that the method can achieve exponential acceleration of data scale in less memory compared with its classical counterpart. Finally, the proposed method is numerically simulated on the Qiskit framework using handwritten digital images in the MNIST dataset. The experimental results provide evidence for the effectiveness of the model. △ Less

Submitted 30 November, 2024; originally announced December 2024.

arXiv:2411.15424 [pdf]

doi 10.1016/j.molliq.2024.126304

Discrepancy in Oil Displacement Mechanisms at the Equivalent Interfacial Tensions: Differentiating Contributions from Surfactant and Nanoparticles on Interfacial Activities

Authors: Suparit Tangparitkul, Thakheru Akamine, David Harbottle, Falan Srisuriyachai, Kai Yu

Abstract: This study examines discrepancies in oil displacement mechanisms at equivalent interfacial tensions, focusing on the distinct contributions of surfactants and nanoparticles. It was hypothesized that similar interfacial activities would result in consistent displacement outcomes, while differences would reflect unique interfacial behaviors. Micromodel experiments revealed that at high interfacial t… ▽ More This study examines discrepancies in oil displacement mechanisms at equivalent interfacial tensions, focusing on the distinct contributions of surfactants and nanoparticles. It was hypothesized that similar interfacial activities would result in consistent displacement outcomes, while differences would reflect unique interfacial behaviors. Micromodel experiments revealed that at high interfacial tension (~20 mN/m), surfactants outperformed nanofluids in efficiency and ultimate oil recovery by reinforcing capillary forces. Conversely, nanofluids showed limited ability to modify interfacial forces. At lower interfacial tensions (6.5 mN/m for surfactants, 15.6 mN/m for nanofluids), both systems displayed similar displacement efficiencies and fingering patterns, driven by distinct mechanisms: capillary instability for surfactants and expansive layer flow for nanofluids. These findings challenge the assumption that nanofluids rely primarily on interfacial tension reduction for enhanced oil recovery (EOR) and highlight the need to refine our understanding of nanoparticle interfacial activities. Future studies should extend these insights to core-scale experiments for a more comprehensive evaluation of two-phase flow dynamics. △ Less

Submitted 22 November, 2024; originally announced November 2024.

Comments: 19 pages

arXiv:2411.14347 [pdf, other]

DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding

Authors: Tianhe Ren, Yihao Chen, Qing Jiang, Zhaoyang Zeng, Yuda Xiong, Wenlong Liu, Zhengyu Ma, Junyi Shen, Yuan Gao, Xiaoke Jiang, Xingyu Chen, Zhuheng Song, Yuhong Zhang, Hongjie Huang, Han Gao, Shilong Liu, Hao Zhang, Feng Li, Kent Yu, Lei Zhang

Abstract: In this paper, we introduce DINO-X, which is a unified object-centric vision model developed by IDEA Research with the best open-world object detection performance to date. DINO-X employs the same Transformer-based encoder-decoder architecture as Grounding DINO 1.5 to pursue an object-level representation for open-world object understanding. To make long-tailed object detection easy, DINO-X extend… ▽ More In this paper, we introduce DINO-X, which is a unified object-centric vision model developed by IDEA Research with the best open-world object detection performance to date. DINO-X employs the same Transformer-based encoder-decoder architecture as Grounding DINO 1.5 to pursue an object-level representation for open-world object understanding. To make long-tailed object detection easy, DINO-X extends its input options to support text prompt, visual prompt, and customized prompt. With such flexible prompt options, we develop a universal object prompt to support prompt-free open-world detection, making it possible to detect anything in an image without requiring users to provide any prompt. To enhance the model's core grounding capability, we have constructed a large-scale dataset with over 100 million high-quality grounding samples, referred to as Grounding-100M, for advancing the model's open-vocabulary detection performance. Pre-training on such a large-scale grounding dataset leads to a foundational object-level representation, which enables DINO-X to integrate multiple perception heads to simultaneously support multiple object perception and understanding tasks, including detection, segmentation, pose estimation, object captioning, object-based QA, etc. Experimental results demonstrate the superior performance of DINO-X. Specifically, the DINO-X Pro model achieves 56.0 AP, 59.8 AP, and 52.4 AP on the COCO, LVIS-minival, and LVIS-val zero-shot object detection benchmarks, respectively. Notably, it scores 63.3 AP and 56.5 AP on the rare classes of LVIS-minival and LVIS-val benchmarks, improving the previous SOTA performance by 5.8 AP and 5.0 AP. Such a result underscores its significantly improved capacity for recognizing long-tailed objects. △ Less

Submitted 15 May, 2025; v1 submitted 21 November, 2024; originally announced November 2024.

Comments: Technical Report

arXiv:2411.13914 [pdf, other]

ICODE: Modeling Dynamical Systems with Extrinsic Input Information

Authors: Zhaoyi Li, Wenjie Mei, Ke Yu, Yang Bai, Shihua Li

Abstract: Learning models of dynamical systems with external inputs, which may be, for example, nonsmooth or piecewise, is crucial for studying complex phenomena and predicting future state evolution, which is essential for applications such as safety guarantees and decision-making. In this work, we introduce \emph{Input Concomitant Neural ODEs (ICODEs)}, which incorporate precise real-time input informatio… ▽ More Learning models of dynamical systems with external inputs, which may be, for example, nonsmooth or piecewise, is crucial for studying complex phenomena and predicting future state evolution, which is essential for applications such as safety guarantees and decision-making. In this work, we introduce \emph{Input Concomitant Neural ODEs (ICODEs)}, which incorporate precise real-time input information into the learning process of the models, rather than treating the inputs as hidden parameters to be learned. The sufficient conditions to ensure the model's contraction property are provided to guarantee that system trajectories of the trained model converge to a fixed point, regardless of initial conditions across different training processes. We validate our method through experiments on several representative real dynamics: Single-link robot, DC-to-DC converter, motion dynamics of a rigid body, Rabinovich-Fabrikant equation, Glycolytic-glycogenolytic pathway model, and heat conduction equation. The experimental results demonstrate that our proposed ICODEs efficiently learn the ground truth systems, achieving superior prediction performance under both typical and atypical inputs. This work offers a valuable class of neural ODE models for understanding physical systems with explicit external input information, with potentially promising applications in fields such as physics and robotics. Our code is available online at https://github.com/EEE-ai59/ICODE.git. △ Less

Submitted 15 April, 2025; v1 submitted 21 November, 2024; originally announced November 2024.

Comments: To be published in IEEE Transactions on Automation Science and Engineering

arXiv:2411.09371 [pdf, other]

DSCformer: A Dual-Branch Network Integrating Enhanced Dynamic Snake Convolution and SegFormer for Crack Segmentation

Authors: Kaiwei Yu, I-Ming Chen, Jing Wu

Abstract: In construction quality monitoring, accurately detecting and segmenting cracks in concrete structures is paramount for safety and maintenance. Current convolutional neural networks (CNNs) have demonstrated strong performance in crack segmentation tasks, yet they often struggle with complex backgrounds and fail to capture fine-grained tubular structures fully. In contrast, Transformers excel at cap… ▽ More In construction quality monitoring, accurately detecting and segmenting cracks in concrete structures is paramount for safety and maintenance. Current convolutional neural networks (CNNs) have demonstrated strong performance in crack segmentation tasks, yet they often struggle with complex backgrounds and fail to capture fine-grained tubular structures fully. In contrast, Transformers excel at capturing global context but lack precision in detailed feature extraction. We introduce DSCformer, a novel hybrid model that integrates an enhanced Dynamic Snake Convolution (DSConv) with a Transformer architecture for crack segmentation to address these challenges. Our key contributions include the enhanced DSConv through a pyramid kernel for adaptive offset computation and a simultaneous bi-directional learnable offset iteration, significantly improving the model's performance to capture intricate crack patterns. Additionally, we propose a Weighted Convolutional Attention Module (WCAM), which refines channel attention, allowing for more precise and adaptive feature attention. We evaluate DSCformer on the Crack3238 and FIND datasets, achieving IoUs of 59.22\% and 87.24\%, respectively. The experimental results suggest that our DSCformer outperforms state-of-the-art methods across different datasets. △ Less

Submitted 14 November, 2024; originally announced November 2024.

arXiv:2411.04142 [pdf, other]

Unified Pathological Speech Analysis with Prompt Tuning

Authors: Fei Yang, Xuenan Xu, Mengyue Wu, Kai Yu

Abstract: Pathological speech analysis has been of interest in the detection of certain diseases like depression and Alzheimer's disease and attracts much interest from researchers. However, previous pathological speech analysis models are commonly designed for a specific disease while overlooking the connection between diseases, which may constrain performance and lower training efficiency. Instead of fine… ▽ More Pathological speech analysis has been of interest in the detection of certain diseases like depression and Alzheimer's disease and attracts much interest from researchers. However, previous pathological speech analysis models are commonly designed for a specific disease while overlooking the connection between diseases, which may constrain performance and lower training efficiency. Instead of fine-tuning deep models for different tasks, prompt tuning is a much more efficient training paradigm. We thus propose a unified pathological speech analysis system for as many as three diseases with the prompt tuning technique. This system uses prompt tuning to adjust only a small part of the parameters to detect different diseases from speeches of possible patients. Our system leverages a pre-trained spoken language model and demonstrates strong performance across multiple disorders while only fine-tuning a fraction of the parameters. This efficient training approach leads to faster convergence and improved F1 scores by allowing knowledge to be shared across tasks. Our experiments on Alzheimer's disease, Depression, and Parkinson's disease show competitive results, highlighting the effectiveness of our method in pathological speech analysis. △ Less

Submitted 5 November, 2024; originally announced November 2024.

Comments: This work has been submitted to the IEEE for possible publication

arXiv:2410.21951 [pdf, other]

Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding

Authors: Bohan Li, Hankun Wang, Situo Zhang, Yiwei Guo, Kai Yu

Abstract: The auto-regressive architecture, like GPTs, is widely used in modern Text-to-Speech (TTS) systems. However, it incurs substantial inference time, particularly due to the challenges in the next-token prediction posed by lengthy sequences of speech tokens. In this work, we introduce VADUSA, one of the first approaches to accelerate auto-regressive TTS through speculative decoding. Our results show… ▽ More The auto-regressive architecture, like GPTs, is widely used in modern Text-to-Speech (TTS) systems. However, it incurs substantial inference time, particularly due to the challenges in the next-token prediction posed by lengthy sequences of speech tokens. In this work, we introduce VADUSA, one of the first approaches to accelerate auto-regressive TTS through speculative decoding. Our results show that VADUSA not only significantly improves inference speed but also enhances performance by incorporating draft heads to predict future speech content auto-regressively. Furthermore, the inclusion of a tolerance mechanism during sampling accelerates inference without compromising quality. Our approach demonstrates strong generalization across large datasets and various types of speech tokens. △ Less

Submitted 9 February, 2025; v1 submitted 29 October, 2024; originally announced October 2024.

Comments: Accepted by ICASSP 2025

MSC Class: 68T07

arXiv:2410.21312 [pdf, other]

$\texttt{PatentAgent}$: Intelligent Agent for Automated Pharmaceutical Patent Analysis

Authors: Xin Wang, Yifan Zhang, Xiaojing Zhang, Longhui Yu, Xinna Lin, Jindong Jiang, Bin Ma, Kaicheng Yu

Abstract: Pharmaceutical patents play a vital role in biochemical industries, especially in drug discovery, providing researchers with unique early access to data, experimental results, and research insights. With the advancement of machine learning, patent analysis has evolved from manual labor to tasks assisted by automatic tools. However, there still lacks an unified agent that assists every aspect of pa… ▽ More Pharmaceutical patents play a vital role in biochemical industries, especially in drug discovery, providing researchers with unique early access to data, experimental results, and research insights. With the advancement of machine learning, patent analysis has evolved from manual labor to tasks assisted by automatic tools. However, there still lacks an unified agent that assists every aspect of patent analysis, from patent reading to core chemical identification. Leveraging the capabilities of Large Language Models (LLMs) to understand requests and follow instructions, we introduce the $\textbf{first}$ intelligent agent in this domain, $\texttt{PatentAgent}$, poised to advance and potentially revolutionize the landscape of pharmaceutical research. $\texttt{PatentAgent}$ comprises three key end-to-end modules -- $\textit{PA-QA}$, $\textit{PA-Img2Mol}$, and $\textit{PA-CoreId}$ -- that respectively perform (1) patent question-answering, (2) image-to-molecular-structure conversion, and (3) core chemical structure identification, addressing the essential needs of scientists and practitioners in pharmaceutical patent analysis. Each module of $\texttt{PatentAgent}$ demonstrates significant effectiveness with the updated algorithm and the synergistic design of $\texttt{PatentAgent}$ framework. $\textit{PA-Img2Mol}$ outperforms existing methods across CLEF, JPO, UOB, and USPTO patent benchmarks with an accuracy gain between 2.46% and 8.37% while $\textit{PA-CoreId}$ realizes accuracy improvement ranging from 7.15% to 7.62% on PatentNetML benchmark. Our code and dataset will be publicly available. △ Less

Submitted 25 October, 2024; originally announced October 2024.

Comments: 7 pages

arXiv:2410.18908 [pdf, other]

A Survey on Speech Large Language Models

Authors: Jing Peng, Yucheng Wang, Yangui Fang, Yu Xi, Xu Li, Xizhuo Zhang, Kai Yu

Abstract: Large Language Models (LLMs) exhibit strong contextual understanding and remarkable multitask performance. As a result, researchers have been actively exploring the integration of LLMs into the domain of speech understanding, with a primary focus on a broad range of speech-to-text tasks. These include automatic speech recognition (ASR), speech-to-text translation (ST), speech emotion recognition (… ▽ More Large Language Models (LLMs) exhibit strong contextual understanding and remarkable multitask performance. As a result, researchers have been actively exploring the integration of LLMs into the domain of speech understanding, with a primary focus on a broad range of speech-to-text tasks. These include automatic speech recognition (ASR), speech-to-text translation (ST), speech emotion recognition (SER), and others. We refer to such models as Speech LLMs, which are typically built on a unified architecture that follows the pipeline of Audio Feature Extraction -> Multimodal Information Fusion -> LLM Inference. This approach enables richer audio feature extraction while facilitating end-to-end fusion of audio and text modalities, thereby achieving deeper understanding and reasoning from audio data. This paper elucidates the development of Speech LLMs, offering an in-depth analysis of system architectures. Through extensive research and a series of targeted experiments, the paper assesses the advancements in Speech LLMs and their potential for cross-task integration within the speech understanding field. Furthermore, it highlights key challenges identified through experimentation, such as the dormancy of LLMs under certain conditions. The paper further explores training strategies for Speech LLMs, proposes potential solutions based on these findings, and offers valuable insights and references for future research. △ Less

Submitted 26 May, 2025; v1 submitted 24 October, 2024; originally announced October 2024.

Comments: This version has been updated to incorporate recent work in the field and includes revised illustrations and textual descriptions

arXiv:2410.18558 [pdf, other]

Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data

Authors: Shuhao Gu, Jialing Zhang, Siyuan Zhou, Kevin Yu, Zhaohu Xing, Liangdong Wang, Zhou Cao, Jintao Jia, Zhuoyi Zhang, Yixuan Wang, Zhenchong Hu, Bo-Wen Zhang, Jijie Li, Dong Liang, Yingli Zhao, Songjing Wang, Yulong Ao, Yiming Ju, Huanhuan Ma, Xiaotong Li, Haiwen Diao, Yufeng Cui, Xinlong Wang, Yaoqi Liu, Fangxiang Feng , et al. (1 additional authors not shown)

Abstract: Recently, Vision-Language Models (VLMs) have achieved remarkable progress in multimodal tasks, and multimodal instruction data serves as the foundation for enhancing VLM capabilities. Despite the availability of several open-source multimodal datasets, limitations in the scale and quality of open-source instruction data hinder the performance of VLMs trained on these datasets, leading to a signifi… ▽ More Recently, Vision-Language Models (VLMs) have achieved remarkable progress in multimodal tasks, and multimodal instruction data serves as the foundation for enhancing VLM capabilities. Despite the availability of several open-source multimodal datasets, limitations in the scale and quality of open-source instruction data hinder the performance of VLMs trained on these datasets, leading to a significant gap compared to models trained on closed-source data. To address this challenge, we introduce Infinity-MM, a large-scale multimodal instruction dataset. We collected the available multimodal instruction datasets and performed unified preprocessing, resulting in a dataset with over 40 million samples that ensures diversity and accuracy. Furthermore, to enable large-scale expansion of instruction data and support the continuous acquisition of high-quality data, we propose a synthetic instruction generation method based on a tagging system and open-source VLMs. By establishing correspondences between different types of images and associated instruction types, this method can provide essential guidance during data synthesis. Leveraging this high-quality data, we have trained a 2-billion-parameter Vision-Language Model, Aquila-VL-2B, which achieves state-of-the-art (SOTA) performance among models of similar scale. The data is available at: https://huggingface.co/datasets/BAAI/Infinity-MM. △ Less

Submitted 6 January, 2025; v1 submitted 24 October, 2024; originally announced October 2024.

arXiv:2410.16805 [pdf, other]

Test-time Adversarial Defense with Opposite Adversarial Path and High Attack Time Cost

Authors: Cheng-Han Yeh, Kuanchun Yu, Chun-Shien Lu

Abstract: Deep learning models are known to be vulnerable to adversarial attacks by injecting sophisticated designed perturbations to input data. Training-time defenses still exhibit a significant performance gap between natural accuracy and robust accuracy. In this paper, we investigate a new test-time adversarial defense method via diffusion-based recovery along opposite adversarial paths (OAPs). We prese… ▽ More Deep learning models are known to be vulnerable to adversarial attacks by injecting sophisticated designed perturbations to input data. Training-time defenses still exhibit a significant performance gap between natural accuracy and robust accuracy. In this paper, we investigate a new test-time adversarial defense method via diffusion-based recovery along opposite adversarial paths (OAPs). We present a purifier that can be plugged into a pre-trained model to resist adversarial attacks. Different from prior arts, the key idea is excessive denoising or purification by integrating the opposite adversarial direction with reverse diffusion to push the input image further toward the opposite adversarial direction. For the first time, we also exemplify the pitfall of conducting AutoAttack (Rand) for diffusion-based defense methods. Through the lens of time complexity, we examine the trade-off between the effectiveness of adaptive attack and its computation complexity against our defense. Experimental evaluation along with time cost analysis verifies the effectiveness of the proposed method. △ Less

Submitted 19 May, 2025; v1 submitted 22 October, 2024; originally announced October 2024.

arXiv:2410.16286 [pdf, other]

Solution for Point Tracking Task of ECCV 2nd Perception Test Challenge 2024

Authors: Yuxuan Zhang, Pengsong Niu, Kun Yu, Qingguo Chen, Yang Yang

Abstract: This report introduces an improved method for the Tracking Any Point~(TAP), focusing on monitoring physical surfaces in video footage. Despite their success with short-sequence scenarios, TAP methods still face performance degradation and resource overhead in long-sequence situations. To address these issues, we propose a simple yet effective approach called Fine-grained Point Discrimination~(\tex… ▽ More This report introduces an improved method for the Tracking Any Point~(TAP), focusing on monitoring physical surfaces in video footage. Despite their success with short-sequence scenarios, TAP methods still face performance degradation and resource overhead in long-sequence situations. To address these issues, we propose a simple yet effective approach called Fine-grained Point Discrimination~(\textbf{FPD}), which focuses on perceiving and rectifying point tracking at multiple granularities in zero-shot manner, especially for static points in the videos shot by a static camera. The proposed FPD contains two key components: $(1)$ Multi-granularity point perception, which can detect static sequences in video and points. $(2)$ Dynamic trajectory correction, which replaces point trajectories based on the type of tracked point. Our approach achieved the second highest score in the final test with a score of $0.4720$. △ Less

Submitted 5 October, 2024; originally announced October 2024.

arXiv:2410.15764 [pdf, other]

LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec

Authors: Yiwei Guo, Zhihan Li, Chenpeng Du, Hankun Wang, Xie Chen, Kai Yu

Abstract: Although discrete speech tokens have exhibited strong potential for language model-based speech generation, their high bitrates and redundant timbre information restrict the development of such models. In this work, we propose LSCodec, a discrete speech codec that has both low bitrate and speaker decoupling ability. LSCodec adopts a multi-stage unsupervised training framework with a speaker pertur… ▽ More Although discrete speech tokens have exhibited strong potential for language model-based speech generation, their high bitrates and redundant timbre information restrict the development of such models. In this work, we propose LSCodec, a discrete speech codec that has both low bitrate and speaker decoupling ability. LSCodec adopts a multi-stage unsupervised training framework with a speaker perturbation technique. A continuous information bottleneck is first established, followed by vector quantization that produces a discrete speaker-decoupled space. A discrete token vocoder finally refines acoustic details from LSCodec. By reconstruction evaluations, LSCodec demonstrates superior intelligibility and audio quality with only a single codebook and smaller vocabulary size than baselines. Voice conversion and speaker probing experiments prove the excellent speaker disentanglement of LSCodec, and ablation study verifies the effectiveness of the proposed training framework. △ Less

Submitted 21 May, 2025; v1 submitted 21 October, 2024; originally announced October 2024.

Comments: 5 pages, 2 figures, 3 tables. Demo page: https://cantabile-kwok.github.io/LSCodec/. Accepted to Interspeech 2025

arXiv:2410.15648 [pdf, other]

Linking Model Intervention to Causal Interpretation in Model Explanation

Authors: Debo Cheng, Ziqi Xu, Jiuyong Li, Lin Liu, Kui Yu, Thuc Duy Le, Jixue Liu

Abstract: Intervention intuition is often used in model explanation where the intervention effect of a feature on the outcome is quantified by the difference of a model prediction when the feature value is changed from the current value to the baseline value. Such a model intervention effect of a feature is inherently association. In this paper, we will study the conditions when an intuitive model intervent… ▽ More Intervention intuition is often used in model explanation where the intervention effect of a feature on the outcome is quantified by the difference of a model prediction when the feature value is changed from the current value to the baseline value. Such a model intervention effect of a feature is inherently association. In this paper, we will study the conditions when an intuitive model intervention effect has a causal interpretation, i.e., when it indicates whether a feature is a direct cause of the outcome. This work links the model intervention effect to the causal interpretation of a model. Such an interpretation capability is important since it indicates whether a machine learning model is trustworthy to domain experts. The conditions also reveal the limitations of using a model intervention effect for causal interpretation in an environment with unobserved features. Experiments on semi-synthetic datasets have been conducted to validate theorems and show the potential for using the model intervention effect for model interpretation. △ Less

Submitted 21 October, 2024; originally announced October 2024.

arXiv:2410.15621 [pdf, other]

DRIM-ANN: An Approximate Nearest Neighbor Search Engine based on Commercial DRAM-PIMs

Authors: Mingkai Chen, Tianhua Han, Cheng Liu, Shengwen Liang, Kuai Yu, Lei Dai, Ziming Yuan, Ying Wang, Lei Zhang, Huawei Li, Xiaowei Li

Abstract: Approximate Nearest Neighbor Search (ANNS), which enables efficient semantic similarity search in large datasets, has become a fundamental component of critical applications such as information retrieval and retrieval-augmented generation (RAG). However, ANNS is a well-known I/O-intensive algorithm with a low compute-to-I/O ratio, often requiring massive storage due to the large volume of high-dim… ▽ More Approximate Nearest Neighbor Search (ANNS), which enables efficient semantic similarity search in large datasets, has become a fundamental component of critical applications such as information retrieval and retrieval-augmented generation (RAG). However, ANNS is a well-known I/O-intensive algorithm with a low compute-to-I/O ratio, often requiring massive storage due to the large volume of high-dimensional data. This leads to I/O bottlenecks on CPUs and memory limitations on GPUs. DRAM-based Processing-in-Memory (DRAM-PIM) architecture, which offers high bandwidth, large-capacity memory, and the ability to perform efficient computation in or near the data, presents a promising solution for ANNS. In this work, we investigate the use of commercial DRAM-PIM for ANNS for the first time and propose DRIM-ANN, an optimized ANNS engine based on DRAM-PIMs from UPMEM. Notably, given that the target DRAM-PIM exhibits an even lower compute-to-I/O ratio than basic ANNS, we leverage lookup tables (LUTs) to replace more multiplications with I/O operations. We then systematically tune ANNS to search optimized configurations with lower computational load, aligning the compute-to-I/O ratio of ANNS with that of DRAM-PIMs while maintaining accuracy constraints. Building on this tuned ANNS algorithm, we further explore implementation optimizations to fully utilize the two thousand parallel processing units with private local memory in DRAM-PIMs. To address the load imbalance caused by ANNS requests distributed across different clusters of large datasets, we propose a load-balancing strategy that combines static data layout optimization with dynamic runtime request scheduling. Experimental results on representative datasets show that DRIM-ANN achieves an average performance speedup of 2.92x compared to a 32-thread CPU counterpart. △ Less

Submitted 20 October, 2024; originally announced October 2024.

arXiv:2410.14357 [pdf, other]

Efficient charge-preserving excited state preparation with variational quantum algorithms

Authors: Zohim Chandani, Kazuki Ikeda, Zhong-Bo Kang, Dmitri E. Kharzeev, Alexander McCaskey, Andrea Palermo, C. R. Ramakrishnan, Pooja Rao, Ranjani G. Sundaram, Kwangmin Yu

Abstract: Determining the spectrum and wave functions of excited states of a system is crucial in quantum physics and chemistry. Low-depth quantum algorithms, such as the Variational Quantum Eigensolver (VQE) and its variants, can be used to determine the ground-state energy. However, current approaches to computing excited states require numerous controlled unitaries, making the application of the original… ▽ More Determining the spectrum and wave functions of excited states of a system is crucial in quantum physics and chemistry. Low-depth quantum algorithms, such as the Variational Quantum Eigensolver (VQE) and its variants, can be used to determine the ground-state energy. However, current approaches to computing excited states require numerous controlled unitaries, making the application of the original Variational Quantum Deflation (VQD) algorithm to problems in chemistry or physics suboptimal. In this study, we introduce a charge-preserving VQD (CPVQD) algorithm, designed to incorporate symmetry and the corresponding conserved charge into the VQD framework. This results in dimension reduction, significantly enhancing the efficiency of excited-state computations. We present benchmark results with GPU-accelerated simulations using systems up to 24 qubits, showcasing applications in high-energy physics, nuclear physics, and quantum chemistry. This work is performed on NERSC's Perlmutter system using NVIDIA's open-source platform for accelerated quantum supercomputing - CUDA-Q. △ Less

Submitted 18 October, 2024; originally announced October 2024.

Comments: 20 pages, 6 figures, 1 table

arXiv:2410.13757 [pdf, other]

MobA: Multifaceted Memory-Enhanced Adaptive Planning for Efficient Mobile Task Automation

Authors: Zichen Zhu, Hao Tang, Yansi Li, Dingye Liu, Hongshen Xu, Kunyao Lan, Danyang Zhang, Yixuan Jiang, Hao Zhou, Chenrun Wang, Situo Zhang, Liangtai Sun, Yixiao Wang, Yuheng Sun, Lu Chen, Kai Yu

Abstract: Existing Multimodal Large Language Model (MLLM)-based agents face significant challenges in handling complex GUI (Graphical User Interface) interactions on devices. These challenges arise from the dynamic and structured nature of GUI environments, which integrate text, images, and spatial relationships, as well as the variability in action spaces across different pages and tasks. To address these… ▽ More Existing Multimodal Large Language Model (MLLM)-based agents face significant challenges in handling complex GUI (Graphical User Interface) interactions on devices. These challenges arise from the dynamic and structured nature of GUI environments, which integrate text, images, and spatial relationships, as well as the variability in action spaces across different pages and tasks. To address these limitations, we propose MobA, a novel MLLM-based mobile assistant system. MobA introduces an adaptive planning module that incorporates a reflection mechanism for error recovery and dynamically adjusts plans to align with the real environment contexts and action module's execution capacity. Additionally, a multifaceted memory module provides comprehensive memory support to enhance adaptability and efficiency. We also present MobBench, a dataset designed for complex mobile interactions. Experimental results on MobBench and AndroidArena demonstrate MobA's ability to handle dynamic GUI environments and perform complex mobile tasks. △ Less

Submitted 13 May, 2025; v1 submitted 17 October, 2024; originally announced October 2024.

Comments: NAACL 2025 Demo Track [code] https://github.com/OpenDFM/MobA [dataset] https://huggingface.co/datasets/OpenDFM/MobA-MobBench

arXiv:2410.12205 [pdf]

Challenges in Adopting Companion Robots: An Exploratory Study of Robotic Companionship Conducted with Chinese Retirees

Authors: Mengyang Wang, Keye Yu, Yukai Zhang, Mingming Fan

Abstract: Companion robots hold immense potential in providing emotional support to older adults in the rapidly aging world. However, questions have been raised regarding whether having a robotic companion benefits healthy older adults, how they perceive the value of companion robots, and what their relationship with companion robots would be like. To understand healthy older adults' perceptions, attitudes,… ▽ More Companion robots hold immense potential in providing emotional support to older adults in the rapidly aging world. However, questions have been raised regarding whether having a robotic companion benefits healthy older adults, how they perceive the value of companion robots, and what their relationship with companion robots would be like. To understand healthy older adults' perceptions, attitudes, and relationships toward companion robots, we conducted multiple focus groups with eighteen retirees. Our findings underscore the social context encountered by older adults in China and reveal the mismatch between the current value proposition of companion robots and healthy older adults' needs. We further identify factors influencing the adoption of robotic companionship, which include individuals' self-disclosure tendencies, quality of companionship, differentiated value, and seamless collaboration with aging-in-community infrastructure and services. △ Less

Submitted 15 October, 2024; originally announced October 2024.

arXiv:2410.11718 [pdf, other]

Converging to a Lingua Franca: Evolution of Linguistic Regions and Semantics Alignment in Multilingual Large Language Models

Authors: Hongchuan Zeng, Senyu Han, Lu Chen, Kai Yu

Abstract: Large language models (LLMs) have demonstrated remarkable performance, particularly in multilingual contexts. While recent studies suggest that LLMs can transfer skills learned in one language to others, the internal mechanisms behind this ability remain unclear. We observed that the neuron activation patterns of LLMs exhibit similarities when processing the same language, revealing the existence… ▽ More Large language models (LLMs) have demonstrated remarkable performance, particularly in multilingual contexts. While recent studies suggest that LLMs can transfer skills learned in one language to others, the internal mechanisms behind this ability remain unclear. We observed that the neuron activation patterns of LLMs exhibit similarities when processing the same language, revealing the existence and location of key linguistic regions. Additionally, we found that neuron activation patterns are similar when processing sentences with the same semantic meaning in different languages. This indicates that LLMs map semantically identical inputs from different languages into a "Lingua Franca", a common semantic latent space that allows for consistent processing across languages. This semantic alignment becomes more pronounced with training and increased model size, resulting in a more language-agnostic activation pattern. Moreover, we found that key linguistic neurons are concentrated in the first and last layers of LLMs, becoming denser in the first layers as training progresses. Experiments on BLOOM and LLaMA2 support these findings, highlighting the structural evolution of multilingual LLMs during training and scaling up. This paper provides insights into the internal workings of LLMs, offering a foundation for future improvements in their cross-lingual capabilities. △ Less

Submitted 28 February, 2025; v1 submitted 15 October, 2024; originally announced October 2024.

Comments: 16 pages, 11 figures, 4 tables

arXiv:2410.10158 [pdf, other]

Improved Regret Bound for Safe Reinforcement Learning via Tighter Cost Pessimism and Reward Optimism

Authors: Kihyun Yu, Duksang Lee, William Overman, Dabeen Lee

Abstract: This paper studies the safe reinforcement learning problem formulated as an episodic finite-horizon tabular constrained Markov decision process with an unknown transition kernel and stochastic reward and cost functions. We propose a model-based algorithm based on novel cost and reward function estimators that provide tighter cost pessimism and reward optimism. While guaranteeing no constraint viol… ▽ More This paper studies the safe reinforcement learning problem formulated as an episodic finite-horizon tabular constrained Markov decision process with an unknown transition kernel and stochastic reward and cost functions. We propose a model-based algorithm based on novel cost and reward function estimators that provide tighter cost pessimism and reward optimism. While guaranteeing no constraint violation in every episode, our algorithm achieves a regret upper bound of $\widetilde{\mathcal{O}}((\bar C - \bar C_b)^{-1}H^{2.5} S\sqrt{AK})$ where $\bar C$ is the cost budget for an episode, $\bar C_b$ is the expected cost under a safe baseline policy over an episode, $H$ is the horizon, and $S$, $A$ and $K$ are the number of states, actions, and episodes, respectively. This improves upon the best-known regret upper bound, and when $\bar C- \bar C_b=Ω(H)$, it nearly matches the regret lower bound of $Ω(H^{1.5}\sqrt{SAK})$. We deduce our cost and reward function estimators via a Bellman-type law of total variance to obtain tight bounds on the expected sum of the variances of value function estimates. This leads to a tighter dependence on the horizon in the function estimators. We also present numerical results to demonstrate the computational effectiveness of our proposed framework. △ Less

Submitted 14 October, 2024; originally announced October 2024.

arXiv:2410.09503 [pdf, other]

SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs

Authors: Wenxi Chen, Ziyang Ma, Xiquan Li, Xuenan Xu, Yuzhe Liang, Zhisheng Zheng, Kai Yu, Xie Chen

Abstract: Automated Audio Captioning (AAC) aims to generate natural textual descriptions for input audio signals. Recent progress in audio pre-trained models and large language models (LLMs) has significantly enhanced audio understanding and textual reasoning capabilities, making improvements in AAC possible. In this paper, we propose SLAM-AAC to further enhance AAC with paraphrasing augmentation and CLAP-R… ▽ More Automated Audio Captioning (AAC) aims to generate natural textual descriptions for input audio signals. Recent progress in audio pre-trained models and large language models (LLMs) has significantly enhanced audio understanding and textual reasoning capabilities, making improvements in AAC possible. In this paper, we propose SLAM-AAC to further enhance AAC with paraphrasing augmentation and CLAP-Refine through LLMs. Our approach uses the self-supervised EAT model to extract fine-grained audio representations, which are then aligned with textual embeddings via lightweight linear layers. The caption generation LLM is efficiently fine-tuned using the LoRA adapter. Drawing inspiration from the back-translation method in machine translation, we implement paraphrasing augmentation to expand the Clotho dataset during pre-training. This strategy helps alleviate the limitation of scarce audio-text pairs and generates more diverse captions from a small set of audio clips. During inference, we introduce the plug-and-play CLAP-Refine strategy to fully exploit multiple decoding outputs, akin to the n-best rescoring strategy in speech recognition. Using the CLAP model for audio-text similarity calculation, we could select the textual descriptions generated by multiple searching beams that best match the input audio. Experimental results show that SLAM-AAC achieves state-of-the-art performance on Clotho V2 and AudioCaps, surpassing previous mainstream models. △ Less

Submitted 12 October, 2024; originally announced October 2024.

arXiv:2410.08565 [pdf, other]

Baichuan-Omni Technical Report

Authors: Yadong Li, Haoze Sun, Mingan Lin, Tianpeng Li, Guosheng Dong, Tao Zhang, Bowen Ding, Wei Song, Zhenglin Cheng, Yuqi Huo, Song Chen, Xu Li, Da Pan, Shusen Zhang, Xin Wu, Zheng Liang, Jun Liu, Tao Zhang, Keer Lu, Yaqi Zhao, Yanjun Shen, Fan Yang, Kaicheng Yu, Tao Lin, Jianhua Xu , et al. (2 additional authors not shown)

Abstract: The salient multimodal capabilities and interactive experience of GPT-4o highlight its critical role in practical applications, yet it lacks a high-performing open-source counterpart. In this paper, we introduce Baichuan-omni, the first open-source 7B Multimodal Large Language Model (MLLM) adept at concurrently processing and analyzing modalities of image, video, audio, and text, while delivering… ▽ More The salient multimodal capabilities and interactive experience of GPT-4o highlight its critical role in practical applications, yet it lacks a high-performing open-source counterpart. In this paper, we introduce Baichuan-omni, the first open-source 7B Multimodal Large Language Model (MLLM) adept at concurrently processing and analyzing modalities of image, video, audio, and text, while delivering an advanced multimodal interactive experience and strong performance. We propose an effective multimodal training schema starting with 7B model and proceeding through two stages of multimodal alignment and multitask fine-tuning across audio, image, video, and text modal. This approach equips the language model with the ability to handle visual and audio data effectively. Demonstrating strong performance across various omni-modal and multimodal benchmarks, we aim for this contribution to serve as a competitive baseline for the open-source community in advancing multimodal understanding and real-time interaction. △ Less

Submitted 27 December, 2024; v1 submitted 11 October, 2024; originally announced October 2024.

arXiv:2410.07675 [pdf, other]

Adversarial Robustness Overestimation and Instability in TRADES

Authors: Jonathan Weiping Li, Ren-Wei Liang, Cheng-Han Yeh, Cheng-Chang Tsai, Kuanchun Yu, Chun-Shien Lu, Shang-Tse Chen

Abstract: This paper examines the phenomenon of probabilistic robustness overestimation in TRADES, a prominent adversarial training method. Our study reveals that TRADES sometimes yields disproportionately high PGD validation accuracy compared to the AutoAttack testing accuracy in the multiclass classification task. This discrepancy highlights a significant overestimation of robustness for these instances,… ▽ More This paper examines the phenomenon of probabilistic robustness overestimation in TRADES, a prominent adversarial training method. Our study reveals that TRADES sometimes yields disproportionately high PGD validation accuracy compared to the AutoAttack testing accuracy in the multiclass classification task. This discrepancy highlights a significant overestimation of robustness for these instances, potentially linked to gradient masking. We further analyze the parameters contributing to unstable models that lead to overestimation. Our findings indicate that smaller batch sizes, lower beta values (which control the weight of the robust loss term in TRADES), larger learning rates, and higher class complexity (e.g., CIFAR-100 versus CIFAR-10) are associated with an increased likelihood of robustness overestimation. By examining metrics such as the First-Order Stationary Condition (FOSC), inner-maximization, and gradient information, we identify the underlying cause of this phenomenon as gradient masking and provide insights into it. Furthermore, our experiments show that certain unstable training instances may return to a state without robust overestimation, inspiring our attempts at a solution. In addition to adjusting parameter settings to reduce instability or retraining when overestimation occurs, we recommend incorporating Gaussian noise in inputs when the FOSC score exceed the threshold. This method aims to mitigate robustness overestimation of TRADES and other similar methods at its source, ensuring more reliable representation of adversarial robustness during evaluation. △ Less

Submitted 10 October, 2024; originally announced October 2024.

arXiv:2410.06885 [pdf, other]

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

Authors: Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, Xie Chen

Abstract: This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as duration model, text encoder, and phoneme alignment, the text input is simply padded with filler tokens to the same length as input speech, and then the denoising is performed for speech generation, which was originally pr… ▽ More This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as duration model, text encoder, and phoneme alignment, the text input is simply padded with filler tokens to the same length as input speech, and then the denoising is performed for speech generation, which was originally proved feasible by E2 TTS. However, the original design of E2 TTS makes it hard to follow due to its slow convergence and low robustness. To address these issues, we first model the input with ConvNeXt to refine the text representation, making it easy to align with the speech. We further propose an inference-time Sway Sampling strategy, which significantly improves our model's performance and efficiency. This sampling strategy for flow step can be easily applied to existing flow matching based models without retraining. Our design allows faster training and achieves an inference RTF of 0.15, which is greatly improved compared to state-of-the-art diffusion-based TTS models. Trained on a public 100K hours multilingual dataset, our F5-TTS exhibits highly natural and expressive zero-shot ability, seamless code-switching capability, and speed control efficiency. We have released all codes and checkpoints to promote community development, at https://SWivid.github.io/F5-TTS/. △ Less

Submitted 20 May, 2025; v1 submitted 9 October, 2024; originally announced October 2024.

Comments: 17 pages, 9 tables, 3 figures

arXiv:2410.06519 [pdf, other]

SEGMENT+: Long Text Processing with Short-Context Language Models

Authors: Wei Shi, Shuang Li, Kerun Yu, Jinglei Chen, Zujie Liang, Xinhui Wu, Yuxi Qian, Feng Wei, Bo Zheng, Jiaqing Liang, Jiangjie Chen, Yanghua Xiao

Abstract: There is a growing interest in expanding the input capacity of language models (LMs) across various domains. However, simply increasing the context window does not guarantee robust performance across diverse long-input processing tasks, such as understanding extensive documents and extracting detailed information from lengthy and noisy data. In response, we introduce SEGMENT+, a general framework… ▽ More There is a growing interest in expanding the input capacity of language models (LMs) across various domains. However, simply increasing the context window does not guarantee robust performance across diverse long-input processing tasks, such as understanding extensive documents and extracting detailed information from lengthy and noisy data. In response, we introduce SEGMENT+, a general framework that enables LMs to handle extended inputs within limited context windows efficiently. SEGMENT+ utilizes structured notes and a filtering module to manage information flow, resulting in a system that is both controllable and interpretable. Our extensive experiments across various model sizes, focusing on long-document question-answering and Needle-in-a-Haystack tasks, demonstrate the effectiveness of SEGMENT+ in improving performance. △ Less

Submitted 8 October, 2024; originally announced October 2024.

Comments: EMNLP 2024

arXiv:2410.04652 [pdf, other]

Multimodal 3D Fusion and In-Situ Learning for Spatially Aware AI

Authors: Chengyuan Xu, Radha Kumaran, Noah Stier, Kangyou Yu, Tobias Höllerer

Abstract: Seamless integration of virtual and physical worlds in augmented reality benefits from the system semantically "understanding" the physical environment. AR research has long focused on the potential of context awareness, demonstrating novel capabilities that leverage the semantics in the 3D environment for various object-level interactions. Meanwhile, the computer vision community has made leaps i… ▽ More Seamless integration of virtual and physical worlds in augmented reality benefits from the system semantically "understanding" the physical environment. AR research has long focused on the potential of context awareness, demonstrating novel capabilities that leverage the semantics in the 3D environment for various object-level interactions. Meanwhile, the computer vision community has made leaps in neural vision-language understanding to enhance environment perception for autonomous tasks. In this work, we introduce a multimodal 3D object representation that unifies both semantic and linguistic knowledge with the geometric representation, enabling user-guided machine learning involving physical objects. We first present a fast multimodal 3D reconstruction pipeline that brings linguistic understanding to AR by fusing CLIP vision-language features into the environment and object models. We then propose "in-situ" machine learning, which, in conjunction with the multimodal representation, enables new tools and interfaces for users to interact with physical spaces and objects in a spatially and linguistically meaningful manner. We demonstrate the usefulness of the proposed system through two real-world AR applications on Magic Leap 2: a) spatial search in physical environments with natural language and b) an intelligent inventory system that tracks object changes over time. We also make our full implementation and demo data available at (https://github.com/cy-xu/spatially_aware_AI) to encourage further exploration and research in spatially aware AI. △ Less

Submitted 6 October, 2024; originally announced October 2024.

Comments: 10 pages, 6 figures, accepted to IEEE ISMAR 2024

ACM Class: I.4.8; H.5.2

arXiv:2410.03733 [pdf, other]

Evaluating the Effects of AI Directors for Quest Selection

Authors: Kristen K. Yu, Matthew Guzdial, Nathan Sturtevant

Abstract: Modern commercial games are designed for mass appeal, not for individual players, but there is a unique opportunity in video games to better fit the individual through adapting game elements. In this paper, we focus on AI Directors, systems which can dynamically modify a game, that personalize the player experience to match the player's preference. In the past, some AI Director studies have provid… ▽ More Modern commercial games are designed for mass appeal, not for individual players, but there is a unique opportunity in video games to better fit the individual through adapting game elements. In this paper, we focus on AI Directors, systems which can dynamically modify a game, that personalize the player experience to match the player's preference. In the past, some AI Director studies have provided inconclusive results, so their effect on player experience is not clear. We take three AI Directors and directly compare them in a human subject study to test their effectiveness on quest selection. Our results show that a non-random AI Director provides a better player experience than a random AI Director. △ Less

Submitted 30 September, 2024; originally announced October 2024.

arXiv:2410.01585 [pdf]

Avatar Appearance and Behavior of Potential Harassers Affect Users' Perceptions and Response Strategies in Social Virtual Reality (VR): A Mixed-Methods Study

Authors: Xuetong Wang, Ziyan Wang, Mingmin Zhang, Kangyou Yu, Pan Hui, Mingming Fan

Abstract: Sexual harassment has been recognized as a significant social issue. In recent years, the emergence of harassment in social virtual reality (VR) has become an important and urgent research topic. We employed a mixed-methods approach by conducting online surveys with VR users (N = 166) and semi-structured interviews with social VR users (N = 18) to investigate how users perceive sexual harassment i… ▽ More Sexual harassment has been recognized as a significant social issue. In recent years, the emergence of harassment in social virtual reality (VR) has become an important and urgent research topic. We employed a mixed-methods approach by conducting online surveys with VR users (N = 166) and semi-structured interviews with social VR users (N = 18) to investigate how users perceive sexual harassment in social VR, focusing on the influence of avatar appearance. Moreover, we derived users' response strategies to sexual harassment and gained insights on platform regulation. This study contributes to the research on sexual harassment in social VR by examining the moderating effect of avatar appearance on user perception of sexual harassment and uncovering the underlying reasons behind response strategies. Moreover, it presents novel prospects and challenges in platform design and regulation domains. △ Less

Submitted 14 October, 2024; v1 submitted 2 October, 2024; originally announced October 2024.

arXiv:2410.00409 [pdf, other]

AlignSum: Data Pyramid Hierarchical Fine-tuning for Aligning with Human Summarization Preference

Authors: Yang Han, Yiming Wang, Rui Wang, Lu Chen, Kai Yu

Abstract: Text summarization tasks commonly employ Pre-trained Language Models (PLMs) to fit diverse standard datasets. While these PLMs excel in automatic evaluations, they frequently underperform in human evaluations, indicating a deviation between their generated summaries and human summarization preferences. This discrepancy is likely due to the low quality of fine-tuning datasets and the limited availa… ▽ More Text summarization tasks commonly employ Pre-trained Language Models (PLMs) to fit diverse standard datasets. While these PLMs excel in automatic evaluations, they frequently underperform in human evaluations, indicating a deviation between their generated summaries and human summarization preferences. This discrepancy is likely due to the low quality of fine-tuning datasets and the limited availability of high-quality human-annotated data that reflect true human preference. To address this challenge, we introduce a novel human summarization preference alignment framework AlignSum. This framework consists of three parts: Firstly, we construct a Data Pymarid with extractive, abstractive, and human-annotated summary data. Secondly, we conduct the Gaussian Resampling to remove summaries with extreme lengths. Finally, we implement the two-stage hierarchical fine-tuning with Data Pymarid after Gaussian Resampling. We apply AlignSum to PLMs on the human-annotated CNN/DailyMail and BBC XSum datasets. Experiments show that with AlignSum, PLMs like BART-Large surpass 175B GPT-3 in both automatic and human evaluations. This demonstrates that AlignSum significantly enhances the alignment of language models with human summarization preferences. △ Less

Submitted 1 October, 2024; originally announced October 2024.

Comments: EMNLP2024 Findings, code at: https://github.com/csyanghan/AlignSum

arXiv:2409.19894 [pdf, other]

TRANSAGENT: An LLM-Based Multi-Agent System for Code Translation

Authors: Zhiqiang Yuan, Weitong Chen, Hanlin Wang, Kai Yu, Xin Peng, Yiling Lou

Abstract: Code translation converts code from one programming language to another while maintaining its original functionality, which is crucial for software migration, system refactoring, and cross-platform development. Traditional rule-based methods rely on manually-written rules, which can be time-consuming and often result in less readable code. To overcome this, learning-based methods have been develop… ▽ More Code translation converts code from one programming language to another while maintaining its original functionality, which is crucial for software migration, system refactoring, and cross-platform development. Traditional rule-based methods rely on manually-written rules, which can be time-consuming and often result in less readable code. To overcome this, learning-based methods have been developed, leveraging parallel data to train models for automated code translation. More recently, the advance of Large Language Models (LLMs) further boosts learning-based code translation. Although promising, LLM-translated program still suffers from diverse quality issues (e.g., syntax errors and semantic errors). In particular, it can be challenging for LLMs to self-debug these errors when simply provided with the corresponding error messages. In this work, we propose a novel LLM-based multi-agent system TRANSAGENT, which enhances LLM-based code translation by fixing the syntax errors and semantic errors with the synergy between four LLM-based agents, including Initial Code Translator, Syntax Error Fixer, Code Aligner, and Semantic Error Fixer. The main insight of TRANSAGENT is to first localize the error code block in the target program based on the execution alignment between the target and source program, which can narrow down the fixing space and thus lower down the fixing difficulties. To evaluate TRANSAGENT, we first construct a new benchmark from recent programming tasks to mitigate the potential data leakage issue. On our benchmark, TRANSAGENT outperforms the latest LLM-based code translation technique UniTrans in both translation effectiveness and efficiency; additionally, our evaluation on different LLMs show the generalization of TRANSAGENT and our ablation study shows the contribution of each agent. △ Less

Submitted 1 October, 2024; v1 submitted 29 September, 2024; originally announced September 2024.

arXiv:2409.19647 [pdf, other]

Fine-Tuning Hybrid Physics-Informed Neural Networks for Vehicle Dynamics Model Estimation

Authors: Shiming Fang, Kaiyan Yu

Abstract: Accurate dynamic modeling is critical for autonomous racing vehicles, especially during high-speed and agile maneuvers where precise motion prediction is essential for safety. Traditional parameter estimation methods face limitations such as reliance on initial guesses, labor-intensive fitting procedures, and complex testing setups. On the other hand, purely data-driven machine learning methods st… ▽ More Accurate dynamic modeling is critical for autonomous racing vehicles, especially during high-speed and agile maneuvers where precise motion prediction is essential for safety. Traditional parameter estimation methods face limitations such as reliance on initial guesses, labor-intensive fitting procedures, and complex testing setups. On the other hand, purely data-driven machine learning methods struggle to capture inherent physical constraints and typically require large datasets for optimal performance. To address these challenges, this paper introduces the Fine-Tuning Hybrid Dynamics (FTHD) method, which integrates supervised and unsupervised Physics-Informed Neural Networks (PINNs), combining physics-based modeling with data-driven techniques. FTHD fine-tunes a pre-trained Deep Dynamics Model (DDM) using a smaller training dataset, delivering superior performance compared to state-of-the-art methods such as the Deep Pacejka Model (DPM) and outperforming the original DDM. Furthermore, an Extended Kalman Filter (EKF) is embedded within FTHD (EKF-FTHD) to effectively manage noisy real-world data, ensuring accurate denoising while preserving the vehicle's essential physical characteristics. The proposed FTHD framework is validated through scaled simulations using the BayesRace Physics-based Simulator and full-scale real-world experiments from the Indy Autonomous Challenge. Results demonstrate that the hybrid approach significantly improves parameter estimation accuracy, even with reduced data, and outperforms existing models. EKF-FTHD enhances robustness by denoising real-world data while maintaining physical insights, representing a notable advancement in vehicle dynamics modeling for high-speed autonomous racing. △ Less

Submitted 29 September, 2024; originally announced September 2024.

arXiv:2409.18968 [pdf, other]

Safety challenges of AI in medicine in the era of large language models

Authors: Xiaoye Wang, Nicole Xi Zhang, Hongyu He, Trang Nguyen, Kun-Hsing Yu, Hao Deng, Cynthia Brandt, Danielle S. Bitterman, Ling Pan, Ching-Yu Cheng, James Zou, Dianbo Liu

Abstract: Recent advancements in artificial intelligence (AI), particularly in large language models (LLMs), have unlocked significant potential to enhance the quality and efficiency of medical care. By introducing a novel way to interact with AI and data through natural language, LLMs offer new opportunities for medical practitioners, patients, and researchers. However, as AI and LLMs become more powerful… ▽ More Recent advancements in artificial intelligence (AI), particularly in large language models (LLMs), have unlocked significant potential to enhance the quality and efficiency of medical care. By introducing a novel way to interact with AI and data through natural language, LLMs offer new opportunities for medical practitioners, patients, and researchers. However, as AI and LLMs become more powerful and especially achieve superhuman performance in some medical tasks, public concerns over their safety have intensified. These concerns about AI safety have emerged as the most significant obstacles to the adoption of AI in medicine. In response, this review examines emerging risks in AI utilization during the LLM era. First, we explore LLM-specific safety challenges from functional and communication perspectives, addressing issues across data collection, model training, and real-world application. We then consider inherent safety problems shared by all AI systems, along with additional complications introduced by LLMs. Last, we discussed how safety issues of using AI in clinical practice and healthcare system operation would undermine trust among patient, clinicians and the public, and how to build confidence in these systems. By emphasizing the development of safe AI, we believe these technologies can be more rapidly and reliably integrated into everyday medical practice to benefit both patients and clinicians. △ Less

Submitted 30 January, 2025; v1 submitted 11 September, 2024; originally announced September 2024.

arXiv:2409.18412 [pdf, other]

SciDFM: A Large Language Model with Mixture-of-Experts for Science

Authors: Liangtai Sun, Danyu Luo, Da Ma, Zihan Zhao, Baocai Chen, Zhennan Shen, Su Zhu, Lu Chen, Xin Chen, Kai Yu

Abstract: Recently, there has been a significant upsurge of interest in leveraging large language models (LLMs) to assist scientific discovery. However, most LLMs only focus on general science, while they lack domain-specific knowledge, such as chemical molecules and amino acid sequences. To bridge these gaps, we introduce SciDFM, a mixture-of-experts LLM, which is trained from scratch and is able to conduc… ▽ More Recently, there has been a significant upsurge of interest in leveraging large language models (LLMs) to assist scientific discovery. However, most LLMs only focus on general science, while they lack domain-specific knowledge, such as chemical molecules and amino acid sequences. To bridge these gaps, we introduce SciDFM, a mixture-of-experts LLM, which is trained from scratch and is able to conduct college-level scientific reasoning and understand molecules and amino acid sequences. We collect a large-scale training corpus containing numerous scientific papers and books from different disciplines as well as data from domain-specific databases. We further fine-tune the pre-trained model on lots of instruction data to improve performances on downstream benchmarks. From experiment results, we show that SciDFM achieves strong performance on general scientific benchmarks such as SciEval and SciQ, and it reaches a SOTA performance on domain-specific benchmarks among models of similar size. We further analyze the expert layers and show that the results of expert selection vary with data from different disciplines. To benefit the broader research community, we open-source SciDFM at https://huggingface.co/OpenDFM/SciDFM-MoE-A5.6B-v1.0. △ Less

Submitted 12 November, 2024; v1 submitted 26 September, 2024; originally announced September 2024.

Comments: 12 pages, 1 figure, 9 tables. Technical Report, accepted by NeurIPS 2024 Workshop FM4Science

arXiv:2409.17741 [pdf, other]

On the origin of a broad QFP wave train: unwinding jet as the driver

Authors: Xinping Zhou, Zehao Tang, Zhining Qu, Ke Yu, Chengrui Zhou, Yuqi Xiang, Ahmed Ahmed Ibrahim, Yuandeng Shen

Abstract: Large-scale extreme-ultraviolet (EUV) waves commonly exhibit as single wavefront and are believed to be caused by coronal mass ejections (CMEs). Utilizing high spatiotemporal resolution imaging observations from the Solar Dynamics Observatory, we present two sequentially generated wave trains originating from the same active region: a narrow quasiperiodic fast-propagating (QFP) wave train that pro… ▽ More Large-scale extreme-ultraviolet (EUV) waves commonly exhibit as single wavefront and are believed to be caused by coronal mass ejections (CMEs). Utilizing high spatiotemporal resolution imaging observations from the Solar Dynamics Observatory, we present two sequentially generated wave trains originating from the same active region: a narrow quasiperiodic fast-propagating (QFP) wave train that propagates along the coronal loop system above the jet and a broad QFP wave train that travels along the solar surface beneath the jet. The measurements indicate that the narrow QFP wave train and the accompanying flare's quasiperiodic pulsations (QPPs) have nearly identical onsets and periods. This result suggests that the accompanying flare process excites the observed narrow QFP wave train. However, the broad QFP wave train starts approximately 2 minutes before the QPPs of the flare, but consistent with the interaction between the unwinding jet and the solar surface. Moreover, we find that the \zx{period of the broad QFP wave train, approximately 130\,s, closely matches that of the unwinding jet}. This period is significantly longer than the 30\,s period of the accompanying flare's QPPs. Based on these findings, we propose that the intermittent energy release of the accompanying flare excited the narrow QFP wave train confined propagating in the coronal loop system. The unwinding jet, rather than the intermittent energy release in the accompanying flare, triggered the broad QFP wave train propagating along the solar surface. △ Less

Submitted 26 September, 2024; originally announced September 2024.

arXiv:2409.14660 [pdf, other]

Fourier neural operators for spatiotemporal dynamics in two-dimensional turbulence

Authors: Mohammad Atif, Pulkit Dubey, Pratik P. Aghor, Vanessa Lopez-Marrero, Tao Zhang, Abdullah Sharfuddin, Kwangmin Yu, Fan Yang, Foluso Ladeinde, Yangang Liu, Meifeng Lin, Lingda Li

Abstract: High-fidelity direct numerical simulation of turbulent flows for most real-world applications remains an outstanding computational challenge. Several machine learning approaches have recently been proposed to alleviate the computational cost even though they become unstable or unphysical for long time predictions. We identify that the Fourier neural operator (FNO) based models combined with a part… ▽ More High-fidelity direct numerical simulation of turbulent flows for most real-world applications remains an outstanding computational challenge. Several machine learning approaches have recently been proposed to alleviate the computational cost even though they become unstable or unphysical for long time predictions. We identify that the Fourier neural operator (FNO) based models combined with a partial differential equation (PDE) solver can accelerate fluid dynamic simulations and thus address computational expense of large-scale turbulence simulations. We treat the FNO model on the same footing as a PDE solver and answer important questions about the volume and temporal resolution of data required to build pre-trained models for turbulence. We also discuss the pitfalls of purely data-driven approaches that need to be avoided by the machine learning models to become viable and competitive tools for long time simulations of turbulence. △ Less

Submitted 25 September, 2024; v1 submitted 22 September, 2024; originally announced September 2024.

arXiv:2409.13194 [pdf, other]

doi 10.1007/s11432-024-4243-0

ChemDFM-X: Towards Large Multimodal Model for Chemistry

Authors: Zihan Zhao, Bo Chen, Jingpiao Li, Lu Chen, Liyang Wen, Pengyu Wang, Zichen Zhu, Danyang Zhang, Ziping Wan, Yansi Li, Zhongyang Dai, Xin Chen, Kai Yu

Abstract: Rapid developments of AI tools are expected to offer unprecedented assistance to the research of natural science including chemistry. However, neither existing unimodal task-specific specialist models nor emerging general large multimodal models (LMM) can cover the wide range of chemical data modality and task categories. To address the real demands of chemists, a cross-modal Chemical General Inte… ▽ More Rapid developments of AI tools are expected to offer unprecedented assistance to the research of natural science including chemistry. However, neither existing unimodal task-specific specialist models nor emerging general large multimodal models (LMM) can cover the wide range of chemical data modality and task categories. To address the real demands of chemists, a cross-modal Chemical General Intelligence (CGI) system, which serves as a truly practical and useful research assistant utilizing the great potential of LMMs, is in great need. In this work, we introduce the first Cross-modal Dialogue Foundation Model for Chemistry (ChemDFM-X). Diverse multimodal data are generated from an initial modality by approximate calculations and task-specific model predictions. This strategy creates sufficient chemical training corpora, while significantly reducing excessive expense, resulting in an instruction-tuning dataset containing 7.6M data. After instruction finetuning, ChemDFM-X is evaluated on extensive experiments of different chemical tasks with various data modalities. The results demonstrate the capacity of ChemDFM-X for multimodal and inter-modal knowledge comprehension. ChemDFM-X marks a significant milestone toward aligning all modalities in chemistry, a step closer to CGI. △ Less

Submitted 2 January, 2025; v1 submitted 19 September, 2024; originally announced September 2024.

Comments: 19 pages, 7 figures, 11 tables

arXiv:2409.10865 [pdf, other]

A Three-Coupled-Channel Analysis of $Z_c(3900)$ Involving $D\bar{D}^*$, $πJ/ψ$, and $ρη_c $

Authors: Kang Yu, Guang-Juan Wang, Jia-Jun Wu, Zhi Yang

Abstract: In this work, we conduct a three-coupled-channel analysis of the $Z_c(3900)$ structure, focusing on the $D\bar{D}^*$, $J/ψπ$, and $ρη_c$ channels, based on the one-boson exchange model. Drawing from previous study on the exotic state $T_{cc}$, we only utilize one more parameter to construct the interactions between the channels. Our model successfully reproduces the experimental line shapes of the… ▽ More In this work, we conduct a three-coupled-channel analysis of the $Z_c(3900)$ structure, focusing on the $D\bar{D}^*$, $J/ψπ$, and $ρη_c$ channels, based on the one-boson exchange model. Drawing from previous study on the exotic state $T_{cc}$, we only utilize one more parameter to construct the interactions between the channels. Our model successfully reproduces the experimental line shapes of the invariant mass distribution at $\sqrt{s} = 4.23$ and $4.26$ GeV for the three channels. Additionally, the finite-volume energy levels in our model show agreement with current LQCD conclusion. Detailed analysis suggests that the $Z_c(3900)$ peaks in the $πJ/ψ$ and $ρη_c$ distributions primarily arise from the triangle loop involving the $D_1 \bar{D} D^*$ intermediate system. In the $D\bar{D}^*$ distribution, the threshold peak is generated by the cascade decay mechanism enhanced by a triangle diagram. Moreover, we find a virtual pole located far from the threshold, indicating that $Z_c(3900)$ peaks are not associated with physical pole. We conclude that the $Z_c(3900)$ peaks are predominantly caused by the threshold cusp. △ Less

Submitted 16 September, 2024; originally announced September 2024.

arXiv:2409.10626 [pdf, other]

Observation of Interface Piezoelectricity in Superconducting Devices on Silicon

Authors: Haoxin Zhou, Eric Li, Kadircan Godeneli, Zi-Huai Zhang, Shahin Jahanbani, Kangdi Yu, Mutasem Odeh, Shaul Aloni, Sinéad Griffin, Alp Sipahigil

Abstract: The evolution of superconducting quantum processors is driven by the need to reduce errors and scale for fault-tolerant computation. Reducing physical qubit error rates requires further advances in the microscopic modeling and control of decoherence mechanisms in superconducting qubits. Piezoelectric interactions contribute to decoherence by mediating energy exchange between microwave photons and… ▽ More The evolution of superconducting quantum processors is driven by the need to reduce errors and scale for fault-tolerant computation. Reducing physical qubit error rates requires further advances in the microscopic modeling and control of decoherence mechanisms in superconducting qubits. Piezoelectric interactions contribute to decoherence by mediating energy exchange between microwave photons and acoustic phonons. Centrosymmetric materials like silicon and sapphire do not display piezoelectricity and are the preferred substrates for superconducting qubits. However, the broken centrosymmetry at material interfaces may lead to piezoelectric losses in qubits. While this loss mechanism was predicted two decades ago, interface piezoelectricity has not been experimentally observed in superconducting devices. Here, we report the observation of interface piezoelectricity at an aluminum-silicon junction and show that it constitutes an important loss channel for superconducting devices. We fabricate aluminum interdigital surface acoustic wave transducers on silicon and demonstrate piezoelectric transduction from room temperature to millikelvin temperatures. We find an effective electromechanical coupling factor of $K^2\approx 2 \times 10^{-5}\%$ comparable to weakly piezoelectric substrates. We model the impact of the measured interface piezoelectric response on superconducting qubits and find that the piezoelectric surface loss channel limits qubit quality factors to $Q\sim10^4-10^8$ for designs with different surface participation ratios and electromechanical mode matching. These results identify electromechanical surface losses as a significant dissipation channel for superconducting qubits, and show the need for heterostructure and phononic engineering to minimize errors in next-generation superconducting qubits. △ Less

Submitted 16 September, 2024; originally announced September 2024.

arXiv:2409.05525 [pdf, other]

Weighted Squared Volume Minimization (WSVM) for Generating Uniform Tetrahedral Meshes

Authors: Kaixin Yu, Yifu Wang, Peng Song, Xiangqiao Meng, Ying He, Jianjun Chen

Abstract: This paper presents a new algorithm, Weighted Squared Volume Minimization (WSVM), for generating high-quality tetrahedral meshes from closed triangle meshes. Drawing inspiration from the principle of minimal surfaces that minimize squared surface area, WSVM employs a new energy function integrating weighted squared volumes for tetrahedral elements. When minimized with constant weights, this energy… ▽ More This paper presents a new algorithm, Weighted Squared Volume Minimization (WSVM), for generating high-quality tetrahedral meshes from closed triangle meshes. Drawing inspiration from the principle of minimal surfaces that minimize squared surface area, WSVM employs a new energy function integrating weighted squared volumes for tetrahedral elements. When minimized with constant weights, this energy promotes uniform volumes among the tetrahedra. Adjusting the weights to account for local geometry further achieves uniform dihedral angles within the mesh. The algorithm begins with an initial tetrahedral mesh generated via Delaunay tetrahedralization and proceeds by sequentially minimizing volume-oriented and then dihedral angle-oriented energies. At each stage, it alternates between optimizing vertex positions and refining mesh connectivity through the iterative process. The algorithm operates fully automatically and requires no parameter tuning. Evaluations on a variety of 3D models demonstrate that WSVM consistently produces tetrahedral meshes of higher quality, with fewer slivers and enhanced uniformity compared to existing methods. Check out further details at the project webpage: https://kaixinyu-hub.github.io/WSVM.github.io. △ Less

Submitted 9 September, 2024; originally announced September 2024.

arXiv:2409.04900 [pdf, other]

XR Prototyping of Mixed Reality Visualizations: Compensating Interaction Latency for a Medical Imaging Robot

Authors: Jan Hendrik Plümer, Kevin Yu, Ulrich Eck, Denis Kalkofen, Philipp Steininger, Nassir Navab, Markus Tatzgern

Abstract: Researching novel user experiences in medicine is challenging due to limited access to equipment and strict ethical protocols. Extended Reality (XR) simulation technologies offer a cost- and time-efficient solution for developing interactive systems. Recent work has shown Extended Reality Prototyping (XRP)'s potential, but its applicability to specific domains like controlling complex machinery ne… ▽ More Researching novel user experiences in medicine is challenging due to limited access to equipment and strict ethical protocols. Extended Reality (XR) simulation technologies offer a cost- and time-efficient solution for developing interactive systems. Recent work has shown Extended Reality Prototyping (XRP)'s potential, but its applicability to specific domains like controlling complex machinery needs further exploration. This paper explores the benefits and limitations of XRP in controlling a mobile medical imaging robot. We compare two XR visualization techniques to reduce perceived latency between user input and robot activation. Our XRP validation study demonstrates its potential for comparative studies, but identifies a gap in modeling human behavior in the analytic XRP validation framework. △ Less

Submitted 16 September, 2024; v1 submitted 7 September, 2024; originally announced September 2024.

arXiv:2409.01995 [pdf, other]

vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders

Authors: Yiwei Guo, Zhihan Li, Junjie Li, Chenpeng Du, Hankun Wang, Shuai Wang, Xie Chen, Kai Yu

Abstract: We propose a new speech discrete token vocoder, vec2wav 2.0, which advances voice conversion (VC). We use discrete tokens from speech self-supervised models as the content features of source speech, and treat VC as a prompted vocoding task. To amend the loss of speaker timbre in the content tokens, vec2wav 2.0 utilizes the WavLM features to provide strong timbre-dependent information. A novel adap… ▽ More We propose a new speech discrete token vocoder, vec2wav 2.0, which advances voice conversion (VC). We use discrete tokens from speech self-supervised models as the content features of source speech, and treat VC as a prompted vocoding task. To amend the loss of speaker timbre in the content tokens, vec2wav 2.0 utilizes the WavLM features to provide strong timbre-dependent information. A novel adaptive Snake activation function is proposed to better incorporate timbre into the waveform reconstruction process. In this way, vec2wav 2.0 learns to alter the speaker timbre appropriately given different reference prompts. Also, no supervised data is required for vec2wav 2.0 to be effectively trained. Experimental results demonstrate that vec2wav 2.0 outperforms all other baselines to a considerable margin in terms of audio quality and speaker similarity in any-to-any VC. Ablation studies verify the effects made by the proposed techniques. Moreover, vec2wav 2.0 achieves competitive cross-lingual VC even only trained on monolingual corpus. Thus, vec2wav 2.0 shows timbre can potentially be manipulated only by speech token vocoders, pushing the frontiers of VC and speech synthesis. △ Less

Submitted 24 May, 2025; v1 submitted 3 September, 2024; originally announced September 2024.

Comments: 5 pages, 3 figures, 2 tables. Demo page: https://cantabile-kwok.github.io/vec2wav2/

Showing 101–150 of 728 results for author: Yu, K