Skip to main content

Showing 101–150 of 728 results for author: Yu, K

.
  1. arXiv:2412.09892  [pdf, other

    cs.CV

    VQTalker: Towards Multilingual Talking Avatars through Facial Motion Tokenization

    Authors: Tao Liu, Ziyang Ma, Qi Chen, Feilong Chen, Shuai Fan, Xie Chen, Kai Yu

    Abstract: We present VQTalker, a Vector Quantization-based framework for multilingual talking head generation that addresses the challenges of lip synchronization and natural motion across diverse languages. Our approach is grounded in the phonetic principle that human speech comprises a finite set of distinct sound units (phonemes) and corresponding visual articulations (visemes), which often share commona… ▽ More

    Submitted 18 December, 2024; v1 submitted 13 December, 2024; originally announced December 2024.

    Comments: 14 pages

  2. arXiv:2412.09028  [pdf, other

    cs.LG eess.SY

    Learning and Current Prediction of PMSM Drive via Differential Neural Networks

    Authors: Wenjie Mei, Xiaorui Wang, Yanrong Lu, Ke Yu, Shihua Li

    Abstract: Learning models for dynamical systems in continuous time is significant for understanding complex phenomena and making accurate predictions. This study presents a novel approach utilizing differential neural networks (DNNs) to model nonlinear systems, specifically permanent magnet synchronous motors (PMSMs), and to predict their current trajectories. The efficacy of our approach is validated throu… ▽ More

    Submitted 12 December, 2024; originally announced December 2024.

  3. arXiv:2412.08443  [pdf, other

    cs.CV cs.MM

    POINTS1.5: Building a Vision-Language Model towards Real World Applications

    Authors: Yuan Liu, Le Tian, Xiao Zhou, Xinyu Gao, Kavio Yu, Yang Yu, Jie Zhou

    Abstract: Vision-language models have made significant strides recently, demonstrating superior performance across a range of tasks, e.g. optical character recognition and complex diagram analysis. Building on this trend, we introduce a new vision-language model, POINTS1.5, designed to excel in various real-world applications. POINTS1.5 is an enhancement of POINTS1.0 and incorporates several key innovations… ▽ More

    Submitted 11 December, 2024; originally announced December 2024.

  4. arXiv:2412.08218  [pdf, other

    cs.DB

    Maximal Clique Enumeration with Hybrid Branching and Early Termination

    Authors: Kaixin Wang, Kaiqiang Yu, Cheng Long

    Abstract: Maximal clique enumeration (MCE) is crucial for tasks like community detection and biological network analysis. Existing algorithms typically adopt the branch-and-bound framework with the vertex-oriented Bron-Kerbosch (BK) branching strategy, which forms the sub-branches by expanding the partial clique with a vertex. In this paper, we present a novel approach called HBBMC, a hybrid framework combi… ▽ More

    Submitted 11 December, 2024; originally announced December 2024.

    Comments: Accepted by ICDE'25

  5. arXiv:2412.06217  [pdf, other

    physics.optics physics.app-ph

    Large Bidirectional Refractive Index Change in Silicon-rich Nitride via Visible Light Trimming

    Authors: Dmitrii Belogolovskii, Md Masudur Rahman, Karl Johnson, Vladimir Fedorov, Andrew Grieco, Nikola Alic, Abdoulaye Ndao, Paul K. L. Yu, Yeshaiahu Fainman

    Abstract: Phase-sensitive integrated photonic devices are highly susceptible to minor manufacturing deviations, resulting in significant performance inconsistencies. This variability has limited the scalability and widespread adoption of these devices. Here, a major advancement is achieved through continuous-wave (CW) visible light (405 nm and 520 nm) trimming of plasma-enhanced chemical vapor deposition (P… ▽ More

    Submitted 15 February, 2025; v1 submitted 9 December, 2024; originally announced December 2024.

    Comments: 23 pages, 11 figures. Replacement reason: Minor changes only to fix typos and improve clarity

  6. arXiv:2412.05004  [pdf, other

    cs.LG cs.CY

    Prompt Transfer for Dual-Aspect Cross Domain Cognitive Diagnosis

    Authors: Fei Liu, Yizhong Zhang, Shuochen Liu, Shengwei Ji, Kui Yu, Le Wu

    Abstract: Cognitive Diagnosis (CD) aims to evaluate students' cognitive states based on their interaction data, enabling downstream applications such as exercise recommendation and personalized learning guidance. However, existing methods often struggle with accuracy drops in cross-domain cognitive diagnosis (CDCD), a practical yet challenging task. While some efforts have explored exercise-aspect CDCD, suc… ▽ More

    Submitted 6 December, 2024; originally announced December 2024.

  7. arXiv:2412.04729  [pdf, other

    cs.CV

    Espresso: High Compression For Rich Extraction From Videos for Your Vision-Language Model

    Authors: Keunwoo Peter Yu, Achal Dave, Rares Ambrus, Jean Mercat

    Abstract: Recent advances in vision-language models (VLMs) have shown great promise in connecting images and text, but extending these models to long videos remains challenging due to the rapid growth in token counts. Models that compress videos by local aggregation in time or space have become popular for handling long-form inputs; however, these pooling-based projectors sacrifice the benefits of fixed-len… ▽ More

    Submitted 16 May, 2025; v1 submitted 5 December, 2024; originally announced December 2024.

    Comments: 16 pages

  8. arXiv:2412.04141  [pdf, ps, other

    cs.CL

    Reducing Tool Hallucination via Reliability Alignment

    Authors: Hongshen Xu, Zichen Zhu, Lei Pan, Zihan Wang, Su Zhu, Da Ma, Ruisheng Cao, Lu Chen, Kai Yu

    Abstract: Large Language Models (LLMs) have expanded their capabilities beyond language generation to interact with external tools, enabling automation and real-world applications. However, tool hallucinations, where models either select inappropriate tools or misuse them, pose significant challenges, leading to erroneous task execution, increased computational costs, and reduced system reliability. To syst… ▽ More

    Submitted 29 May, 2025; v1 submitted 5 December, 2024; originally announced December 2024.

  9. arXiv:2412.02252  [pdf, other

    cs.CL

    Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity

    Authors: Da Ma, Lu Chen, Situo Zhang, Yuxun Miao, Su Zhu, Zhi Chen, Hongshen Xu, Hanqi Li, Shuai Fan, Lei Pan, Kai Yu

    Abstract: The increasing context window size in Large Language Models (LLMs), such as the GPT and LLaMA series, has improved their ability to tackle complex, long-text tasks, but at the cost of inference efficiency, particularly regarding memory and computational complexity. Existing methods, including selective token retention and window-based attention, improve efficiency but risk discarding important tok… ▽ More

    Submitted 3 December, 2024; originally announced December 2024.

    Comments: preprint

  10. arXiv:2412.00645  [pdf, ps, other

    quant-ph

    Quantum Convolutional Neural Network with Flexible Stride

    Authors: Kai Yu, Song Lin, Bin-Bin Cai

    Abstract: Convolutional neural network is a crucial tool for machine learning, especially in the field of computer vision. Its unique structure and characteristics provide significant advantages in feature extraction. However, with the exponential growth of data scale, classical computing architectures face serious challenges in terms of time efficiency and memory requirements. In this paper, we propose a n… ▽ More

    Submitted 30 November, 2024; originally announced December 2024.

  11. Discrepancy in Oil Displacement Mechanisms at the Equivalent Interfacial Tensions: Differentiating Contributions from Surfactant and Nanoparticles on Interfacial Activities

    Authors: Suparit Tangparitkul, Thakheru Akamine, David Harbottle, Falan Srisuriyachai, Kai Yu

    Abstract: This study examines discrepancies in oil displacement mechanisms at equivalent interfacial tensions, focusing on the distinct contributions of surfactants and nanoparticles. It was hypothesized that similar interfacial activities would result in consistent displacement outcomes, while differences would reflect unique interfacial behaviors. Micromodel experiments revealed that at high interfacial t… ▽ More

    Submitted 22 November, 2024; originally announced November 2024.

    Comments: 19 pages

  12. arXiv:2411.14347  [pdf, other

    cs.CV

    DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding

    Authors: Tianhe Ren, Yihao Chen, Qing Jiang, Zhaoyang Zeng, Yuda Xiong, Wenlong Liu, Zhengyu Ma, Junyi Shen, Yuan Gao, Xiaoke Jiang, Xingyu Chen, Zhuheng Song, Yuhong Zhang, Hongjie Huang, Han Gao, Shilong Liu, Hao Zhang, Feng Li, Kent Yu, Lei Zhang

    Abstract: In this paper, we introduce DINO-X, which is a unified object-centric vision model developed by IDEA Research with the best open-world object detection performance to date. DINO-X employs the same Transformer-based encoder-decoder architecture as Grounding DINO 1.5 to pursue an object-level representation for open-world object understanding. To make long-tailed object detection easy, DINO-X extend… ▽ More

    Submitted 15 May, 2025; v1 submitted 21 November, 2024; originally announced November 2024.

    Comments: Technical Report

  13. arXiv:2411.13914  [pdf, other

    cs.LG

    ICODE: Modeling Dynamical Systems with Extrinsic Input Information

    Authors: Zhaoyi Li, Wenjie Mei, Ke Yu, Yang Bai, Shihua Li

    Abstract: Learning models of dynamical systems with external inputs, which may be, for example, nonsmooth or piecewise, is crucial for studying complex phenomena and predicting future state evolution, which is essential for applications such as safety guarantees and decision-making. In this work, we introduce \emph{Input Concomitant Neural ODEs (ICODEs)}, which incorporate precise real-time input informatio… ▽ More

    Submitted 15 April, 2025; v1 submitted 21 November, 2024; originally announced November 2024.

    Comments: To be published in IEEE Transactions on Automation Science and Engineering

  14. arXiv:2411.09371  [pdf, other

    cs.CV

    DSCformer: A Dual-Branch Network Integrating Enhanced Dynamic Snake Convolution and SegFormer for Crack Segmentation

    Authors: Kaiwei Yu, I-Ming Chen, Jing Wu

    Abstract: In construction quality monitoring, accurately detecting and segmenting cracks in concrete structures is paramount for safety and maintenance. Current convolutional neural networks (CNNs) have demonstrated strong performance in crack segmentation tasks, yet they often struggle with complex backgrounds and fail to capture fine-grained tubular structures fully. In contrast, Transformers excel at cap… ▽ More

    Submitted 14 November, 2024; originally announced November 2024.

  15. arXiv:2411.04142  [pdf, other

    eess.AS cs.CL cs.SD

    Unified Pathological Speech Analysis with Prompt Tuning

    Authors: Fei Yang, Xuenan Xu, Mengyue Wu, Kai Yu

    Abstract: Pathological speech analysis has been of interest in the detection of certain diseases like depression and Alzheimer's disease and attracts much interest from researchers. However, previous pathological speech analysis models are commonly designed for a specific disease while overlooking the connection between diseases, which may constrain performance and lower training efficiency. Instead of fine… ▽ More

    Submitted 5 November, 2024; originally announced November 2024.

    Comments: This work has been submitted to the IEEE for possible publication

  16. arXiv:2410.21951  [pdf, other

    eess.AS cs.AI cs.SD

    Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding

    Authors: Bohan Li, Hankun Wang, Situo Zhang, Yiwei Guo, Kai Yu

    Abstract: The auto-regressive architecture, like GPTs, is widely used in modern Text-to-Speech (TTS) systems. However, it incurs substantial inference time, particularly due to the challenges in the next-token prediction posed by lengthy sequences of speech tokens. In this work, we introduce VADUSA, one of the first approaches to accelerate auto-regressive TTS through speculative decoding. Our results show… ▽ More

    Submitted 9 February, 2025; v1 submitted 29 October, 2024; originally announced October 2024.

    Comments: Accepted by ICASSP 2025

    MSC Class: 68T07

  17. arXiv:2410.21312  [pdf, other

    cs.LG cs.AI cs.CL

    $\texttt{PatentAgent}$: Intelligent Agent for Automated Pharmaceutical Patent Analysis

    Authors: Xin Wang, Yifan Zhang, Xiaojing Zhang, Longhui Yu, Xinna Lin, Jindong Jiang, Bin Ma, Kaicheng Yu

    Abstract: Pharmaceutical patents play a vital role in biochemical industries, especially in drug discovery, providing researchers with unique early access to data, experimental results, and research insights. With the advancement of machine learning, patent analysis has evolved from manual labor to tasks assisted by automatic tools. However, there still lacks an unified agent that assists every aspect of pa… ▽ More

    Submitted 25 October, 2024; originally announced October 2024.

    Comments: 7 pages

  18. arXiv:2410.18908  [pdf, other

    eess.AS

    A Survey on Speech Large Language Models

    Authors: Jing Peng, Yucheng Wang, Yangui Fang, Yu Xi, Xu Li, Xizhuo Zhang, Kai Yu

    Abstract: Large Language Models (LLMs) exhibit strong contextual understanding and remarkable multitask performance. As a result, researchers have been actively exploring the integration of LLMs into the domain of speech understanding, with a primary focus on a broad range of speech-to-text tasks. These include automatic speech recognition (ASR), speech-to-text translation (ST), speech emotion recognition (… ▽ More

    Submitted 26 May, 2025; v1 submitted 24 October, 2024; originally announced October 2024.

    Comments: This version has been updated to incorporate recent work in the field and includes revised illustrations and textual descriptions

  19. arXiv:2410.18558  [pdf, other

    cs.CL

    Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data

    Authors: Shuhao Gu, Jialing Zhang, Siyuan Zhou, Kevin Yu, Zhaohu Xing, Liangdong Wang, Zhou Cao, Jintao Jia, Zhuoyi Zhang, Yixuan Wang, Zhenchong Hu, Bo-Wen Zhang, Jijie Li, Dong Liang, Yingli Zhao, Songjing Wang, Yulong Ao, Yiming Ju, Huanhuan Ma, Xiaotong Li, Haiwen Diao, Yufeng Cui, Xinlong Wang, Yaoqi Liu, Fangxiang Feng , et al. (1 additional authors not shown)

    Abstract: Recently, Vision-Language Models (VLMs) have achieved remarkable progress in multimodal tasks, and multimodal instruction data serves as the foundation for enhancing VLM capabilities. Despite the availability of several open-source multimodal datasets, limitations in the scale and quality of open-source instruction data hinder the performance of VLMs trained on these datasets, leading to a signifi… ▽ More

    Submitted 6 January, 2025; v1 submitted 24 October, 2024; originally announced October 2024.

  20. arXiv:2410.16805  [pdf, other

    cs.LG cs.CR

    Test-time Adversarial Defense with Opposite Adversarial Path and High Attack Time Cost

    Authors: Cheng-Han Yeh, Kuanchun Yu, Chun-Shien Lu

    Abstract: Deep learning models are known to be vulnerable to adversarial attacks by injecting sophisticated designed perturbations to input data. Training-time defenses still exhibit a significant performance gap between natural accuracy and robust accuracy. In this paper, we investigate a new test-time adversarial defense method via diffusion-based recovery along opposite adversarial paths (OAPs). We prese… ▽ More

    Submitted 19 May, 2025; v1 submitted 22 October, 2024; originally announced October 2024.

  21. arXiv:2410.16286  [pdf, other

    cs.CV

    Solution for Point Tracking Task of ECCV 2nd Perception Test Challenge 2024

    Authors: Yuxuan Zhang, Pengsong Niu, Kun Yu, Qingguo Chen, Yang Yang

    Abstract: This report introduces an improved method for the Tracking Any Point~(TAP), focusing on monitoring physical surfaces in video footage. Despite their success with short-sequence scenarios, TAP methods still face performance degradation and resource overhead in long-sequence situations. To address these issues, we propose a simple yet effective approach called Fine-grained Point Discrimination~(\tex… ▽ More

    Submitted 5 October, 2024; originally announced October 2024.

  22. arXiv:2410.15764  [pdf, other

    eess.AS cs.AI cs.SD

    LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec

    Authors: Yiwei Guo, Zhihan Li, Chenpeng Du, Hankun Wang, Xie Chen, Kai Yu

    Abstract: Although discrete speech tokens have exhibited strong potential for language model-based speech generation, their high bitrates and redundant timbre information restrict the development of such models. In this work, we propose LSCodec, a discrete speech codec that has both low bitrate and speaker decoupling ability. LSCodec adopts a multi-stage unsupervised training framework with a speaker pertur… ▽ More

    Submitted 21 May, 2025; v1 submitted 21 October, 2024; originally announced October 2024.

    Comments: 5 pages, 2 figures, 3 tables. Demo page: https://cantabile-kwok.github.io/LSCodec/. Accepted to Interspeech 2025

  23. arXiv:2410.15648  [pdf, other

    cs.LG stat.ME

    Linking Model Intervention to Causal Interpretation in Model Explanation

    Authors: Debo Cheng, Ziqi Xu, Jiuyong Li, Lin Liu, Kui Yu, Thuc Duy Le, Jixue Liu

    Abstract: Intervention intuition is often used in model explanation where the intervention effect of a feature on the outcome is quantified by the difference of a model prediction when the feature value is changed from the current value to the baseline value. Such a model intervention effect of a feature is inherently association. In this paper, we will study the conditions when an intuitive model intervent… ▽ More

    Submitted 21 October, 2024; originally announced October 2024.

  24. arXiv:2410.15621  [pdf, other

    cs.PF

    DRIM-ANN: An Approximate Nearest Neighbor Search Engine based on Commercial DRAM-PIMs

    Authors: Mingkai Chen, Tianhua Han, Cheng Liu, Shengwen Liang, Kuai Yu, Lei Dai, Ziming Yuan, Ying Wang, Lei Zhang, Huawei Li, Xiaowei Li

    Abstract: Approximate Nearest Neighbor Search (ANNS), which enables efficient semantic similarity search in large datasets, has become a fundamental component of critical applications such as information retrieval and retrieval-augmented generation (RAG). However, ANNS is a well-known I/O-intensive algorithm with a low compute-to-I/O ratio, often requiring massive storage due to the large volume of high-dim… ▽ More

    Submitted 20 October, 2024; originally announced October 2024.

  25. arXiv:2410.14357  [pdf, other

    quant-ph cs.DC hep-ph physics.chem-ph

    Efficient charge-preserving excited state preparation with variational quantum algorithms

    Authors: Zohim Chandani, Kazuki Ikeda, Zhong-Bo Kang, Dmitri E. Kharzeev, Alexander McCaskey, Andrea Palermo, C. R. Ramakrishnan, Pooja Rao, Ranjani G. Sundaram, Kwangmin Yu

    Abstract: Determining the spectrum and wave functions of excited states of a system is crucial in quantum physics and chemistry. Low-depth quantum algorithms, such as the Variational Quantum Eigensolver (VQE) and its variants, can be used to determine the ground-state energy. However, current approaches to computing excited states require numerous controlled unitaries, making the application of the original… ▽ More

    Submitted 18 October, 2024; originally announced October 2024.

    Comments: 20 pages, 6 figures, 1 table

  26. arXiv:2410.13757  [pdf, other

    cs.MA cs.AI cs.CL cs.HC

    MobA: Multifaceted Memory-Enhanced Adaptive Planning for Efficient Mobile Task Automation

    Authors: Zichen Zhu, Hao Tang, Yansi Li, Dingye Liu, Hongshen Xu, Kunyao Lan, Danyang Zhang, Yixuan Jiang, Hao Zhou, Chenrun Wang, Situo Zhang, Liangtai Sun, Yixiao Wang, Yuheng Sun, Lu Chen, Kai Yu

    Abstract: Existing Multimodal Large Language Model (MLLM)-based agents face significant challenges in handling complex GUI (Graphical User Interface) interactions on devices. These challenges arise from the dynamic and structured nature of GUI environments, which integrate text, images, and spatial relationships, as well as the variability in action spaces across different pages and tasks. To address these… ▽ More

    Submitted 13 May, 2025; v1 submitted 17 October, 2024; originally announced October 2024.

    Comments: NAACL 2025 Demo Track [code] https://github.com/OpenDFM/MobA [dataset] https://huggingface.co/datasets/OpenDFM/MobA-MobBench

  27. arXiv:2410.12205  [pdf

    cs.HC

    Challenges in Adopting Companion Robots: An Exploratory Study of Robotic Companionship Conducted with Chinese Retirees

    Authors: Mengyang Wang, Keye Yu, Yukai Zhang, Mingming Fan

    Abstract: Companion robots hold immense potential in providing emotional support to older adults in the rapidly aging world. However, questions have been raised regarding whether having a robotic companion benefits healthy older adults, how they perceive the value of companion robots, and what their relationship with companion robots would be like. To understand healthy older adults' perceptions, attitudes,… ▽ More

    Submitted 15 October, 2024; originally announced October 2024.

  28. arXiv:2410.11718  [pdf, other

    cs.CL

    Converging to a Lingua Franca: Evolution of Linguistic Regions and Semantics Alignment in Multilingual Large Language Models

    Authors: Hongchuan Zeng, Senyu Han, Lu Chen, Kai Yu

    Abstract: Large language models (LLMs) have demonstrated remarkable performance, particularly in multilingual contexts. While recent studies suggest that LLMs can transfer skills learned in one language to others, the internal mechanisms behind this ability remain unclear. We observed that the neuron activation patterns of LLMs exhibit similarities when processing the same language, revealing the existence… ▽ More

    Submitted 28 February, 2025; v1 submitted 15 October, 2024; originally announced October 2024.

    Comments: 16 pages, 11 figures, 4 tables

  29. arXiv:2410.10158  [pdf, other

    cs.LG math.OC

    Improved Regret Bound for Safe Reinforcement Learning via Tighter Cost Pessimism and Reward Optimism

    Authors: Kihyun Yu, Duksang Lee, William Overman, Dabeen Lee

    Abstract: This paper studies the safe reinforcement learning problem formulated as an episodic finite-horizon tabular constrained Markov decision process with an unknown transition kernel and stochastic reward and cost functions. We propose a model-based algorithm based on novel cost and reward function estimators that provide tighter cost pessimism and reward optimism. While guaranteeing no constraint viol… ▽ More

    Submitted 14 October, 2024; originally announced October 2024.

  30. arXiv:2410.09503  [pdf, other

    eess.AS cs.SD

    SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs

    Authors: Wenxi Chen, Ziyang Ma, Xiquan Li, Xuenan Xu, Yuzhe Liang, Zhisheng Zheng, Kai Yu, Xie Chen

    Abstract: Automated Audio Captioning (AAC) aims to generate natural textual descriptions for input audio signals. Recent progress in audio pre-trained models and large language models (LLMs) has significantly enhanced audio understanding and textual reasoning capabilities, making improvements in AAC possible. In this paper, we propose SLAM-AAC to further enhance AAC with paraphrasing augmentation and CLAP-R… ▽ More

    Submitted 12 October, 2024; originally announced October 2024.

  31. arXiv:2410.08565  [pdf, other

    cs.AI cs.CL cs.CV

    Baichuan-Omni Technical Report

    Authors: Yadong Li, Haoze Sun, Mingan Lin, Tianpeng Li, Guosheng Dong, Tao Zhang, Bowen Ding, Wei Song, Zhenglin Cheng, Yuqi Huo, Song Chen, Xu Li, Da Pan, Shusen Zhang, Xin Wu, Zheng Liang, Jun Liu, Tao Zhang, Keer Lu, Yaqi Zhao, Yanjun Shen, Fan Yang, Kaicheng Yu, Tao Lin, Jianhua Xu , et al. (2 additional authors not shown)

    Abstract: The salient multimodal capabilities and interactive experience of GPT-4o highlight its critical role in practical applications, yet it lacks a high-performing open-source counterpart. In this paper, we introduce Baichuan-omni, the first open-source 7B Multimodal Large Language Model (MLLM) adept at concurrently processing and analyzing modalities of image, video, audio, and text, while delivering… ▽ More

    Submitted 27 December, 2024; v1 submitted 11 October, 2024; originally announced October 2024.

  32. arXiv:2410.07675  [pdf, other

    cs.LG cs.AI

    Adversarial Robustness Overestimation and Instability in TRADES

    Authors: Jonathan Weiping Li, Ren-Wei Liang, Cheng-Han Yeh, Cheng-Chang Tsai, Kuanchun Yu, Chun-Shien Lu, Shang-Tse Chen

    Abstract: This paper examines the phenomenon of probabilistic robustness overestimation in TRADES, a prominent adversarial training method. Our study reveals that TRADES sometimes yields disproportionately high PGD validation accuracy compared to the AutoAttack testing accuracy in the multiclass classification task. This discrepancy highlights a significant overestimation of robustness for these instances,… ▽ More

    Submitted 10 October, 2024; originally announced October 2024.

  33. arXiv:2410.06885  [pdf, other

    eess.AS cs.SD

    F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

    Authors: Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, Xie Chen

    Abstract: This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as duration model, text encoder, and phoneme alignment, the text input is simply padded with filler tokens to the same length as input speech, and then the denoising is performed for speech generation, which was originally pr… ▽ More

    Submitted 20 May, 2025; v1 submitted 9 October, 2024; originally announced October 2024.

    Comments: 17 pages, 9 tables, 3 figures

  34. arXiv:2410.06519  [pdf, other

    cs.CL

    SEGMENT+: Long Text Processing with Short-Context Language Models

    Authors: Wei Shi, Shuang Li, Kerun Yu, Jinglei Chen, Zujie Liang, Xinhui Wu, Yuxi Qian, Feng Wei, Bo Zheng, Jiaqing Liang, Jiangjie Chen, Yanghua Xiao

    Abstract: There is a growing interest in expanding the input capacity of language models (LMs) across various domains. However, simply increasing the context window does not guarantee robust performance across diverse long-input processing tasks, such as understanding extensive documents and extracting detailed information from lengthy and noisy data. In response, we introduce SEGMENT+, a general framework… ▽ More

    Submitted 8 October, 2024; originally announced October 2024.

    Comments: EMNLP 2024

  35. arXiv:2410.04652  [pdf, other

    cs.HC cs.AI cs.CV

    Multimodal 3D Fusion and In-Situ Learning for Spatially Aware AI

    Authors: Chengyuan Xu, Radha Kumaran, Noah Stier, Kangyou Yu, Tobias Höllerer

    Abstract: Seamless integration of virtual and physical worlds in augmented reality benefits from the system semantically "understanding" the physical environment. AR research has long focused on the potential of context awareness, demonstrating novel capabilities that leverage the semantics in the 3D environment for various object-level interactions. Meanwhile, the computer vision community has made leaps i… ▽ More

    Submitted 6 October, 2024; originally announced October 2024.

    Comments: 10 pages, 6 figures, accepted to IEEE ISMAR 2024

    ACM Class: I.4.8; H.5.2

  36. arXiv:2410.03733  [pdf, other

    cs.HC cs.AI

    Evaluating the Effects of AI Directors for Quest Selection

    Authors: Kristen K. Yu, Matthew Guzdial, Nathan Sturtevant

    Abstract: Modern commercial games are designed for mass appeal, not for individual players, but there is a unique opportunity in video games to better fit the individual through adapting game elements. In this paper, we focus on AI Directors, systems which can dynamically modify a game, that personalize the player experience to match the player's preference. In the past, some AI Director studies have provid… ▽ More

    Submitted 30 September, 2024; originally announced October 2024.

  37. arXiv:2410.01585  [pdf

    cs.HC

    Avatar Appearance and Behavior of Potential Harassers Affect Users' Perceptions and Response Strategies in Social Virtual Reality (VR): A Mixed-Methods Study

    Authors: Xuetong Wang, Ziyan Wang, Mingmin Zhang, Kangyou Yu, Pan Hui, Mingming Fan

    Abstract: Sexual harassment has been recognized as a significant social issue. In recent years, the emergence of harassment in social virtual reality (VR) has become an important and urgent research topic. We employed a mixed-methods approach by conducting online surveys with VR users (N = 166) and semi-structured interviews with social VR users (N = 18) to investigate how users perceive sexual harassment i… ▽ More

    Submitted 14 October, 2024; v1 submitted 2 October, 2024; originally announced October 2024.

  38. arXiv:2410.00409  [pdf, other

    cs.CL

    AlignSum: Data Pyramid Hierarchical Fine-tuning for Aligning with Human Summarization Preference

    Authors: Yang Han, Yiming Wang, Rui Wang, Lu Chen, Kai Yu

    Abstract: Text summarization tasks commonly employ Pre-trained Language Models (PLMs) to fit diverse standard datasets. While these PLMs excel in automatic evaluations, they frequently underperform in human evaluations, indicating a deviation between their generated summaries and human summarization preferences. This discrepancy is likely due to the low quality of fine-tuning datasets and the limited availa… ▽ More

    Submitted 1 October, 2024; originally announced October 2024.

    Comments: EMNLP2024 Findings, code at: https://github.com/csyanghan/AlignSum

  39. arXiv:2409.19894  [pdf, other

    cs.SE cs.AI

    TRANSAGENT: An LLM-Based Multi-Agent System for Code Translation

    Authors: Zhiqiang Yuan, Weitong Chen, Hanlin Wang, Kai Yu, Xin Peng, Yiling Lou

    Abstract: Code translation converts code from one programming language to another while maintaining its original functionality, which is crucial for software migration, system refactoring, and cross-platform development. Traditional rule-based methods rely on manually-written rules, which can be time-consuming and often result in less readable code. To overcome this, learning-based methods have been develop… ▽ More

    Submitted 1 October, 2024; v1 submitted 29 September, 2024; originally announced September 2024.

  40. arXiv:2409.19647  [pdf, other

    cs.RO cs.AI eess.SY

    Fine-Tuning Hybrid Physics-Informed Neural Networks for Vehicle Dynamics Model Estimation

    Authors: Shiming Fang, Kaiyan Yu

    Abstract: Accurate dynamic modeling is critical for autonomous racing vehicles, especially during high-speed and agile maneuvers where precise motion prediction is essential for safety. Traditional parameter estimation methods face limitations such as reliance on initial guesses, labor-intensive fitting procedures, and complex testing setups. On the other hand, purely data-driven machine learning methods st… ▽ More

    Submitted 29 September, 2024; originally announced September 2024.

  41. arXiv:2409.18968  [pdf, other

    cs.CY cs.AI cs.LG

    Safety challenges of AI in medicine in the era of large language models

    Authors: Xiaoye Wang, Nicole Xi Zhang, Hongyu He, Trang Nguyen, Kun-Hsing Yu, Hao Deng, Cynthia Brandt, Danielle S. Bitterman, Ling Pan, Ching-Yu Cheng, James Zou, Dianbo Liu

    Abstract: Recent advancements in artificial intelligence (AI), particularly in large language models (LLMs), have unlocked significant potential to enhance the quality and efficiency of medical care. By introducing a novel way to interact with AI and data through natural language, LLMs offer new opportunities for medical practitioners, patients, and researchers. However, as AI and LLMs become more powerful… ▽ More

    Submitted 30 January, 2025; v1 submitted 11 September, 2024; originally announced September 2024.

  42. arXiv:2409.18412  [pdf, other

    cs.CL cs.AI

    SciDFM: A Large Language Model with Mixture-of-Experts for Science

    Authors: Liangtai Sun, Danyu Luo, Da Ma, Zihan Zhao, Baocai Chen, Zhennan Shen, Su Zhu, Lu Chen, Xin Chen, Kai Yu

    Abstract: Recently, there has been a significant upsurge of interest in leveraging large language models (LLMs) to assist scientific discovery. However, most LLMs only focus on general science, while they lack domain-specific knowledge, such as chemical molecules and amino acid sequences. To bridge these gaps, we introduce SciDFM, a mixture-of-experts LLM, which is trained from scratch and is able to conduc… ▽ More

    Submitted 12 November, 2024; v1 submitted 26 September, 2024; originally announced September 2024.

    Comments: 12 pages, 1 figure, 9 tables. Technical Report, accepted by NeurIPS 2024 Workshop FM4Science

  43. arXiv:2409.17741  [pdf, other

    astro-ph.SR

    On the origin of a broad QFP wave train: unwinding jet as the driver

    Authors: Xinping Zhou, Zehao Tang, Zhining Qu, Ke Yu, Chengrui Zhou, Yuqi Xiang, Ahmed Ahmed Ibrahim, Yuandeng Shen

    Abstract: Large-scale extreme-ultraviolet (EUV) waves commonly exhibit as single wavefront and are believed to be caused by coronal mass ejections (CMEs). Utilizing high spatiotemporal resolution imaging observations from the Solar Dynamics Observatory, we present two sequentially generated wave trains originating from the same active region: a narrow quasiperiodic fast-propagating (QFP) wave train that pro… ▽ More

    Submitted 26 September, 2024; originally announced September 2024.

  44. arXiv:2409.14660  [pdf, other

    physics.flu-dyn cs.LG nlin.CD

    Fourier neural operators for spatiotemporal dynamics in two-dimensional turbulence

    Authors: Mohammad Atif, Pulkit Dubey, Pratik P. Aghor, Vanessa Lopez-Marrero, Tao Zhang, Abdullah Sharfuddin, Kwangmin Yu, Fan Yang, Foluso Ladeinde, Yangang Liu, Meifeng Lin, Lingda Li

    Abstract: High-fidelity direct numerical simulation of turbulent flows for most real-world applications remains an outstanding computational challenge. Several machine learning approaches have recently been proposed to alleviate the computational cost even though they become unstable or unphysical for long time predictions. We identify that the Fourier neural operator (FNO) based models combined with a part… ▽ More

    Submitted 25 September, 2024; v1 submitted 22 September, 2024; originally announced September 2024.

  45. ChemDFM-X: Towards Large Multimodal Model for Chemistry

    Authors: Zihan Zhao, Bo Chen, Jingpiao Li, Lu Chen, Liyang Wen, Pengyu Wang, Zichen Zhu, Danyang Zhang, Ziping Wan, Yansi Li, Zhongyang Dai, Xin Chen, Kai Yu

    Abstract: Rapid developments of AI tools are expected to offer unprecedented assistance to the research of natural science including chemistry. However, neither existing unimodal task-specific specialist models nor emerging general large multimodal models (LMM) can cover the wide range of chemical data modality and task categories. To address the real demands of chemists, a cross-modal Chemical General Inte… ▽ More

    Submitted 2 January, 2025; v1 submitted 19 September, 2024; originally announced September 2024.

    Comments: 19 pages, 7 figures, 11 tables

  46. arXiv:2409.10865  [pdf, other

    hep-ph

    A Three-Coupled-Channel Analysis of $Z_c(3900)$ Involving $D\bar{D}^*$, $πJ/ψ$, and $ρη_c $

    Authors: Kang Yu, Guang-Juan Wang, Jia-Jun Wu, Zhi Yang

    Abstract: In this work, we conduct a three-coupled-channel analysis of the $Z_c(3900)$ structure, focusing on the $D\bar{D}^*$, $J/ψπ$, and $ρη_c$ channels, based on the one-boson exchange model. Drawing from previous study on the exotic state $T_{cc}$, we only utilize one more parameter to construct the interactions between the channels. Our model successfully reproduces the experimental line shapes of the… ▽ More

    Submitted 16 September, 2024; originally announced September 2024.

  47. arXiv:2409.10626  [pdf, other

    quant-ph cond-mat.mes-hall physics.app-ph

    Observation of Interface Piezoelectricity in Superconducting Devices on Silicon

    Authors: Haoxin Zhou, Eric Li, Kadircan Godeneli, Zi-Huai Zhang, Shahin Jahanbani, Kangdi Yu, Mutasem Odeh, Shaul Aloni, Sinéad Griffin, Alp Sipahigil

    Abstract: The evolution of superconducting quantum processors is driven by the need to reduce errors and scale for fault-tolerant computation. Reducing physical qubit error rates requires further advances in the microscopic modeling and control of decoherence mechanisms in superconducting qubits. Piezoelectric interactions contribute to decoherence by mediating energy exchange between microwave photons and… ▽ More

    Submitted 16 September, 2024; originally announced September 2024.

  48. arXiv:2409.05525  [pdf, other

    cs.GR

    Weighted Squared Volume Minimization (WSVM) for Generating Uniform Tetrahedral Meshes

    Authors: Kaixin Yu, Yifu Wang, Peng Song, Xiangqiao Meng, Ying He, Jianjun Chen

    Abstract: This paper presents a new algorithm, Weighted Squared Volume Minimization (WSVM), for generating high-quality tetrahedral meshes from closed triangle meshes. Drawing inspiration from the principle of minimal surfaces that minimize squared surface area, WSVM employs a new energy function integrating weighted squared volumes for tetrahedral elements. When minimized with constant weights, this energy… ▽ More

    Submitted 9 September, 2024; originally announced September 2024.

  49. arXiv:2409.04900  [pdf, other

    cs.HC

    XR Prototyping of Mixed Reality Visualizations: Compensating Interaction Latency for a Medical Imaging Robot

    Authors: Jan Hendrik Plümer, Kevin Yu, Ulrich Eck, Denis Kalkofen, Philipp Steininger, Nassir Navab, Markus Tatzgern

    Abstract: Researching novel user experiences in medicine is challenging due to limited access to equipment and strict ethical protocols. Extended Reality (XR) simulation technologies offer a cost- and time-efficient solution for developing interactive systems. Recent work has shown Extended Reality Prototyping (XRP)'s potential, but its applicability to specific domains like controlling complex machinery ne… ▽ More

    Submitted 16 September, 2024; v1 submitted 7 September, 2024; originally announced September 2024.

  50. arXiv:2409.01995  [pdf, other

    eess.AS cs.AI cs.SD

    vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders

    Authors: Yiwei Guo, Zhihan Li, Junjie Li, Chenpeng Du, Hankun Wang, Shuai Wang, Xie Chen, Kai Yu

    Abstract: We propose a new speech discrete token vocoder, vec2wav 2.0, which advances voice conversion (VC). We use discrete tokens from speech self-supervised models as the content features of source speech, and treat VC as a prompted vocoding task. To amend the loss of speaker timbre in the content tokens, vec2wav 2.0 utilizes the WavLM features to provide strong timbre-dependent information. A novel adap… ▽ More

    Submitted 24 May, 2025; v1 submitted 3 September, 2024; originally announced September 2024.

    Comments: 5 pages, 3 figures, 2 tables. Demo page: https://cantabile-kwok.github.io/vec2wav2/