Skip to main content

Showing 1–50 of 111 results for author: Yi, J

Searching in archive eess. Search in all archives.
.
  1. arXiv:2505.11079  [pdf, ps, other

    cs.SD cs.CL eess.AS

    $\mathcal{A}LLM4ADD$: Unlocking the Capabilities of Audio Large Language Models for Audio Deepfake Detection

    Authors: Hao Gu, Jiangyan Yi, Chenglong Wang, Jianhua Tao, Zheng Lian, Jiayi He, Yong Ren, Yujie Chen, Zhengqi Wen

    Abstract: Audio deepfake detection (ADD) has grown increasingly important due to the rise of high-fidelity audio generative models and their potential for misuse. Given that audio large language models (ALLMs) have made significant progress in various audio processing tasks, a heuristic question arises: Can ALLMs be leveraged to solve ADD?. In this paper, we first conduct a comprehensive zero-shot evaluatio… ▽ More

    Submitted 16 May, 2025; originally announced May 2025.

  2. arXiv:2503.15487  [pdf, other

    eess.IV eess.SP physics.optics q-bio.NC stat.AP

    Fast Two-photon Microscopy by Neuroimaging with Oblong Random Acquisition (NORA)

    Authors: Esther Whang, Skyler Thomas, Ji Yi, Adam S. Charles

    Abstract: Advances in neural imaging have enabled neuroscience to study how the joint activity of large neural populations conspire to produce perception, behavior and cognition. Despite many advances in optical methods, there exists a fundamental tradeoff between imaging speed, field of view, and resolution that limits the scope of neural imaging, especially for the raster-scanning multi-photon imaging nee… ▽ More

    Submitted 19 March, 2025; originally announced March 2025.

    Comments: 12 pages, 4 figures

  3. arXiv:2412.11551  [pdf, other

    cs.SD cs.AI eess.AS

    Region-Based Optimization in Continual Learning for Audio Deepfake Detection

    Authors: Yujie Chen, Jiangyan Yi, Cunhang Fan, Jianhua Tao, Yong Ren, Siding Zeng, Chu Yuan Zhang, Xinrui Yan, Hao Gu, Jun Xue, Chenglong Wang, Zhao Lv, Xiaohui Zhang

    Abstract: Rapid advancements in speech synthesis and voice conversion bring convenience but also new security risks, creating an urgent need for effective audio deepfake detection. Although current models perform well, their effectiveness diminishes when confronted with the diverse and evolving nature of real-world deepfakes. To address this issue, we propose a continual learning method named Region-Based O… ▽ More

    Submitted 16 December, 2024; originally announced December 2024.

    Comments: Accepted by AAAI 2025

  4. arXiv:2412.01425  [pdf, other

    cs.SD cs.AI eess.AS

    Reject Threshold Adaptation for Open-Set Model Attribution of Deepfake Audio

    Authors: Xinrui Yan, Jiangyan Yi, Jianhua Tao, Yujie Chen, Hao Gu, Guanjun Li, Junzuo Zhou, Yong Ren, Tao Xu

    Abstract: Open environment oriented open set model attribution of deepfake audio is an emerging research topic, aiming to identify the generation models of deepfake audio. Most previous work requires manually setting a rejection threshold for unknown classes to compare with predicted probabilities. However, models often overfit training instances and generate overly confident predictions. Moreover, threshol… ▽ More

    Submitted 2 December, 2024; originally announced December 2024.

    Comments: Accepted by ISCSLP 2024

  5. arXiv:2411.19514  [pdf, other

    eess.IV cs.CV cs.LG

    Enhancing AI microscopy for foodborne bacterial classification via adversarial domain adaptation across optical and biological variability

    Authors: Siddhartha Bhattacharya, Aarham Wasit, Mason Earles, Nitin Nitin, Luyao Ma, Jiyoon Yi

    Abstract: Rapid detection of foodborne bacteria is critical for food safety and quality, yet traditional culture-based methods require extended incubation and specialized sample preparation. This study addresses these challenges by i) enhancing the generalizability of AI-enabled microscopy for bacterial classification using adversarial domain adaptation and ii) comparing the performance of single-target and… ▽ More

    Submitted 29 November, 2024; originally announced November 2024.

  6. arXiv:2411.14493  [pdf

    cs.CL cs.SD eess.AS

    From Statistical Methods to Pre-Trained Models; A Survey on Automatic Speech Recognition for Resource Scarce Urdu Language

    Authors: Muhammad Sharif, Zeeshan Abbas, Jiangyan Yi, Chenglin Liu

    Abstract: Automatic Speech Recognition (ASR) technology has witnessed significant advancements in recent years, revolutionizing human-computer interactions. While major languages have benefited from these developments, lesser-resourced languages like Urdu face unique challenges. This paper provides an extensive exploration of the dynamic landscape of ASR research, focusing particularly on the resource-const… ▽ More

    Submitted 20 November, 2024; originally announced November 2024.

    Comments: Submitted to SN Computer Science

  7. arXiv:2410.21256  [pdf, other

    cs.AI cs.CV eess.IV

    Multi-modal AI for comprehensive breast cancer prognostication

    Authors: Jan Witowski, Ken G. Zeng, Joseph Cappadona, Jailan Elayoubi, Khalil Choucair, Elena Diana Chiru, Nancy Chan, Young-Joon Kang, Frederick Howard, Irina Ostrovnaya, Carlos Fernandez-Granda, Freya Schnabel, Zoe Steinsnyder, Ugur Ozerdem, Kangning Liu, Waleed Abdulsattar, Yu Zong, Lina Daoud, Rafic Beydoun, Anas Saad, Nitya Thakore, Mohammad Sadic, Frank Yeung, Elisa Liu, Theodore Hill , et al. (26 additional authors not shown)

    Abstract: Treatment selection in breast cancer is guided by molecular subtypes and clinical characteristics. However, current tools including genomic assays lack the accuracy required for optimal clinical decision-making. We developed a novel artificial intelligence (AI)-based approach that integrates digital pathology images with clinical data, providing a more robust and effective method for predicting th… ▽ More

    Submitted 2 March, 2025; v1 submitted 28 October, 2024; originally announced October 2024.

  8. arXiv:2409.12121  [pdf, other

    cs.SD eess.AS

    WMCodec: End-to-End Neural Speech Codec with Deep Watermarking for Authenticity Verification

    Authors: Junzuo Zhou, Jiangyan Yi, Yong Ren, Jianhua Tao, Tao Wang, Chu Yuan Zhang

    Abstract: Recent advances in speech spoofing necessitate stronger verification mechanisms in neural speech codecs to ensure authenticity. Current methods embed numerical watermarks before compression and extract them from reconstructed speech for verification, but face limitations such as separate training processes for the watermark and codec, and insufficient cross-modal information integration, leading t… ▽ More

    Submitted 27 December, 2024; v1 submitted 18 September, 2024; originally announced September 2024.

  9. arXiv:2408.17009  [pdf, other

    cs.SD eess.AS

    Utilizing Speaker Profiles for Impersonation Audio Detection

    Authors: Hao Gu, JiangYan Yi, Chenglong Wang, Yong Ren, Jianhua Tao, Xinrui Yan, Yujie Chen, Xiaohui Zhang

    Abstract: Fake audio detection is an emerging active topic. A growing number of literatures have aimed to detect fake utterance, which are mostly generated by Text-to-speech (TTS) or voice conversion (VC). However, countermeasures against impersonation remain an underexplored area. Impersonation is a fake type that involves an imitator replicating specific traits and speech style of a target speaker. Unlike… ▽ More

    Submitted 30 August, 2024; originally announced August 2024.

    Comments: Accepted by ACM MM2024

  10. arXiv:2408.05758  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing

    Authors: Chunyu Qiang, Wang Geng, Yi Zhao, Ruibo Fu, Tao Wang, Cheng Gong, Tianrui Wang, Qiuyu Liu, Jiangyan Yi, Zhengqi Wen, Chen Zhang, Hao Che, Longbiao Wang, Jianwu Dang, Jianhua Tao

    Abstract: Deep learning has brought significant improvements to the field of cross-modal representation learning. For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired, emphasizing the semantic content of the text modality while de-emphasizing the paralinguistic information of the spe… ▽ More

    Submitted 27 May, 2025; v1 submitted 11 August, 2024; originally announced August 2024.

  11. arXiv:2408.04967  [pdf, other

    eess.AS cs.SD

    ADD 2023: Towards Audio Deepfake Detection and Analysis in the Wild

    Authors: Jiangyan Yi, Chu Yuan Zhang, Jianhua Tao, Chenglong Wang, Xinrui Yan, Yong Ren, Hao Gu, Junzuo Zhou

    Abstract: The growing prominence of the field of audio deepfake detection is driven by its wide range of applications, notably in protecting the public from potential fraud and other malicious activities, prompting the need for greater attention and research in this area. The ADD 2023 challenge goes beyond binary real/fake classification by emulating real-world scenarios, such as the identification of manip… ▽ More

    Submitted 11 December, 2024; v1 submitted 9 August, 2024; originally announced August 2024.

    Comments: This work has been submitted to the IEEE for possible publication

  12. arXiv:2407.21611  [pdf, other

    cs.SD cs.AI eess.AS

    Enhancing Partially Spoofed Audio Localization with Boundary-aware Attention Mechanism

    Authors: Jiafeng Zhong, Bin Li, Jiangyan Yi

    Abstract: The task of partially spoofed audio localization aims to accurately determine audio authenticity at a frame level. Although some works have achieved encouraging results, utilizing boundary information within a single model remains an unexplored research topic. In this work, we propose a novel method called Boundary-aware Attention Mechanism (BAM). Specifically, it consists of two core modules: Bou… ▽ More

    Submitted 19 August, 2024; v1 submitted 31 July, 2024; originally announced July 2024.

    Comments: Accepted by interspeech 2024

  13. arXiv:2407.10157  [pdf, other

    eess.IV cs.CV

    SACNet: A Spatially Adaptive Convolution Network for 2D Multi-organ Medical Segmentation

    Authors: Lin Zhang, Wenbo Gao, Jie Yi, Yunyun Yang

    Abstract: Multi-organ segmentation in medical image analysis is crucial for diagnosis and treatment planning. However, many factors complicate the task, including variability in different target categories and interference from complex backgrounds. In this paper, we utilize the knowledge of Deformable Convolution V3 (DCNv3) and multi-object segmentation to optimize our Spatially Adaptive Convolution Network… ▽ More

    Submitted 7 February, 2025; v1 submitted 14 July, 2024; originally announced July 2024.

    Comments: Accepted by BIBM 2024

  14. arXiv:2407.08239  [pdf, other

    cs.SD cs.LG eess.AS

    An Unsupervised Domain Adaptation Method for Locating Manipulated Region in partially fake Audio

    Authors: Siding Zeng, Jiangyan Yi, Jianhua Tao, Yujie Chen, Shan Liang, Yong Ren, Xiaohui Zhang

    Abstract: When the task of locating manipulation regions in partially-fake audio (PFA) involves cross-domain datasets, the performance of deep learning models drops significantly due to the shift between the source and target domains. To address this issue, existing approaches often employ data augmentation before training. However, they overlook the characteristics in target domain that are absent in sourc… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

  15. arXiv:2407.00888  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Papez: Resource-Efficient Speech Separation with Auditory Working Memory

    Authors: Hyunseok Oh, Juheon Yi, Youngki Lee

    Abstract: Transformer-based models recently reached state-of-the-art single-channel speech separation accuracy; However, their extreme computational load makes it difficult to deploy them in resource-constrained mobile or IoT devices. We thus present Papez, a lightweight and computation-efficient single-channel speech separation model. Papez is based on three key techniques. We first replace the inter-chunk… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

    Comments: 5 pages. Accepted by ICASSP 2023

  16. arXiv:2406.16200  [pdf, other

    cs.LG cs.CR cs.IT eess.SP

    Towards unlocking the mystery of adversarial fragility of neural networks

    Authors: Jingchao Gao, Raghu Mudumbai, Xiaodong Wu, Jirong Yi, Catherine Xu, Hui Xie, Weiyu Xu

    Abstract: In this paper, we study the adversarial robustness of deep neural networks for classification tasks. We look at the smallest magnitude of possible additive perturbations that can change the output of a classification algorithm. We provide a matrix-theoretic explanation of the adversarial fragility of deep neural network for classification. In particular, our theoretical results show that neural ne… ▽ More

    Submitted 23 June, 2024; originally announced June 2024.

    Comments: 21 pages

  17. arXiv:2406.09664  [pdf, other

    cs.SD eess.AS

    Frequency-mix Knowledge Distillation for Fake Speech Detection

    Authors: Cunhang Fan, Shunbo Dong, Jun Xue, Yujie Chen, Jiangyan Yi, Zhao Lv

    Abstract: In the telephony scenarios, the fake speech detection (FSD) task to combat speech spoofing attacks is challenging. Data augmentation (DA) methods are considered effective means to address the FSD task in telephony scenarios, typically divided into time domain and frequency domain stages. While each has its advantages, both can result in information loss. To tackle this issue, we propose a novel DA… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  18. arXiv:2406.06086  [pdf, other

    cs.SD eess.AS

    RawBMamba: End-to-End Bidirectional State Space Model for Audio Deepfake Detection

    Authors: Yujie Chen, Jiangyan Yi, Jun Xue, Chenglong Wang, Xiaohui Zhang, Shunbo Dong, Siding Zeng, Jianhua Tao, Lv Zhao, Cunhang Fan

    Abstract: Fake artefacts for discriminating between bonafide and fake audio can exist in both short- and long-range segments. Therefore, combining local and global feature information can effectively discriminate between bonafide and fake audio. This paper proposes an end-to-end bidirectional state space model, named RawBMamba, to capture both short- and long-range discriminative information for audio deepf… ▽ More

    Submitted 18 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  19. arXiv:2406.04840  [pdf, other

    cs.SD eess.AS

    TraceableSpeech: Towards Proactively Traceable Text-to-Speech with Watermarking

    Authors: Junzuo Zhou, Jiangyan Yi, Tao Wang, Jianhua Tao, Ye Bai, Chu Yuan Zhang, Yong Ren, Zhengqi Wen

    Abstract: Various threats posed by the progress in text-to-speech (TTS) have prompted the need to reliably trace synthesized speech. However, contemporary approaches to this task involve adding watermarks to the audio separately after generation, a process that hurts both speech quality and watermark imperceptibility. In addition, these approaches are limited in robustness and flexibility. To address these… ▽ More

    Submitted 15 November, 2024; v1 submitted 7 June, 2024; originally announced June 2024.

    Comments: acceped by interspeech 2024

  20. arXiv:2405.08596  [pdf, ps, other

    cs.SD eess.AS

    Towards Robust Audio Deepfake Detection: A Evolving Benchmark for Continual Learning

    Authors: Xiaohui Zhang, Jiangyan Yi, Jianhua Tao

    Abstract: The rise of advanced large language models such as GPT-4, GPT-4o, and the Claude family has made fake audio detection increasingly challenging. Traditional fine-tuning methods struggle to keep pace with the evolving landscape of synthetic speech, necessitating continual learning approaches that can adapt to new audio while retaining the ability to detect older types. Continual learning, which acts… ▽ More

    Submitted 13 August, 2024; v1 submitted 14 May, 2024; originally announced May 2024.

  21. arXiv:2404.16346  [pdf, other

    eess.IV cs.AI cs.CV

    Light-weight Retinal Layer Segmentation with Global Reasoning

    Authors: Xiang He, Weiye Song, Yiming Wang, Fabio Poiesi, Ji Yi, Manishi Desai, Quanqing Xu, Kongzheng Yang, Yi Wan

    Abstract: Automatic retinal layer segmentation with medical images, such as optical coherence tomography (OCT) images, serves as an important tool for diagnosing ophthalmic diseases. However, it is challenging to achieve accurate segmentation due to low contrast and blood flow noises presented in the images. In addition, the algorithm should be light-weight to be deployed for practical clinical applications… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

    Comments: IEEE Transactions on Instrumentation & Measurement

  22. arXiv:2403.06072   

    cs.IT eess.SP

    Channel Estimation Considerate Precoder Design for Multi-user Massive MIMO-OFDM Systems: The Concept and Fast Algorithms

    Authors: Liu Junkai, Jiang Yi

    Abstract: The sixth-generation (6G) communication networks target peak data rates exceeding 1Tbps, necessitating base stations (BS) to support up to 100 simultaneous data streams. However, sparse pilot allocation to accommodate such streams poses challenges for users' channel estimation. This paper presents Channel Estimation Considerate Precoding (CECP), where BS precoders prioritize facilitating channel e… ▽ More

    Submitted 7 April, 2024; v1 submitted 9 March, 2024; originally announced March 2024.

    Comments: The work is supported by HUAWEI cooperation, which is related to the current HUAWEI project. HUAWEI cooperation requires to withdraw the paper

  23. Complete and Near-Optimal Robotic Crack Coverage and Filling in Civil Infrastructure

    Authors: Vishnu Veeraraghavan, Kyle Hunte, Jingang Yi, Kaiyan Yu

    Abstract: We present a simultaneous sensor-based inspection and footprint coverage (SIFC) planning and control design with applications to autonomous robotic crack mapping and filling. The main challenge of the SIFC problem lies in the coupling of complete sensing (for mapping) and robotic footprint (for filling) coverage tasks. Initially, we assume known target information (e.g., cracks) and employ classic… ▽ More

    Submitted 28 September, 2024; v1 submitted 1 March, 2024; originally announced March 2024.

    Journal ref: IEEE Transactions on Robotics, vol. 40, pp. 2850-2867, 2024

  24. arXiv:2402.10055  [pdf

    eess.IV cs.AI cs.CV

    Robust semi-automatic vessel tracing in the human retinal image by an instance segmentation neural network

    Authors: Siyi Chen, Amir H. Kashani, Ji Yi

    Abstract: The morphology and hierarchy of the vascular systems are essential for perfusion in supporting metabolism. In human retina, one of the most energy-demanding organs, retinal circulation nourishes the entire inner retina by an intricate vasculature emerging and remerging at the optic nerve head (ONH). Thus, tracing the vascular branching from ONH through the vascular tree can illustrate vascular hie… ▽ More

    Submitted 15 February, 2024; originally announced February 2024.

  25. arXiv:2401.03650  [pdf, other

    eess.AS cs.SD eess.SP

    DDD: A Perceptually Superior Low-Response-Time DNN-based Declipper

    Authors: Jayeon Yi, Junghyun Koo, Kyogu Lee

    Abstract: Clipping is a common nonlinear distortion that occurs whenever the input or output of an audio system exceeds the supported range. This phenomenon undermines not only the perception of speech quality but also downstream processes utilizing the disrupted signal. Therefore, a real-time-capable, robust, and low-response-time method for speech declipping (SD) is desired. In this work, we introduce DDD… ▽ More

    Submitted 7 January, 2024; originally announced January 2024.

    Comments: To appear, ICASSP 2024. Demo samples at https://stet-stet.github.io/DDD, repo at https://github.com/stet-stet/DDD

  26. arXiv:2401.03488  [pdf, other

    cs.LG cs.CR eess.SP

    Data-Driven Subsampling in the Presence of an Adversarial Actor

    Authors: Abu Shafin Mohammad Mahdee Jameel, Ahmed P. Mohamed, Jinho Yi, Aly El Gamal, Akshay Malhotra

    Abstract: Deep learning based automatic modulation classification (AMC) has received significant attention owing to its potential applications in both military and civilian use cases. Recently, data-driven subsampling techniques have been utilized to overcome the challenges associated with computational complexity and training time for AMC. Beyond these direct advantages of data-driven subsampling, these me… ▽ More

    Submitted 7 January, 2024; originally announced January 2024.

    Comments: Accepted for publication at ICMLCN 2024

  27. arXiv:2312.10155  [pdf, ps, other

    cs.RO eess.SY

    Gaussian Process-Based Learning Control of Underactuated Balance Robots with an External and Internal Convertible Modeling Structure

    Authors: Feng Han, Jingang Yi

    Abstract: External and internal convertible (EIC) form-based motion control is one of the effective designs of simultaneously trajectory tracking and balance for underactuated balance robots. Under certain conditions, the EIC-based control design however leads to uncontrolled robot motion. We present a Gaussian process (GP)-based data-driven learning control for underactuated balance robots with the EIC mod… ▽ More

    Submitted 15 December, 2023; originally announced December 2023.

  28. arXiv:2312.09651  [pdf, other

    cs.SD cs.CR cs.LG eess.AS

    What to Remember: Self-Adaptive Continual Learning for Audio Deepfake Detection

    Authors: Xiaohui Zhang, Jiangyan Yi, Chenglong Wang, Chuyuan Zhang, Siding Zeng, Jianhua Tao

    Abstract: The rapid evolution of speech synthesis and voice conversion has raised substantial concerns due to the potential misuse of such technology, prompting a pressing need for effective audio deepfake detection mechanisms. Existing detection models have shown remarkable success in discriminating known deepfake audio, but struggle when encountering new attack types. To address this challenge, one of the… ▽ More

    Submitted 15 December, 2023; originally announced December 2023.

    Comments: Accepted by the main track The 38th Annual AAAI Conference on Artificial Intelligence (AAAI 2024)

  29. arXiv:2311.13687  [pdf, other

    cs.LG cs.MM cs.SD eess.AS

    Beat-Aligned Spectrogram-to-Sequence Generation of Rhythm-Game Charts

    Authors: Jayeon Yi, Sungho Lee, Kyogu Lee

    Abstract: In the heart of "rhythm games" - games where players must perform actions in sync with a piece of music - are "charts", the directives to be given to players. We newly formulate chart generation as a sequence generation task and train a Transformer using a large dataset. We also introduce tempo-informed preprocessing and training procedures, some of which are suggested to be integral for a success… ▽ More

    Submitted 22 November, 2023; originally announced November 2023.

    Comments: ISMIR 2023 LBD. Demo videos and code at stet-stet.github.io/goct

  30. arXiv:2311.07613   

    eess.SY cs.LG math.DS

    A Physics-informed Machine Learning-based Control Method for Nonlinear Dynamic Systems with Highly Noisy Measurements

    Authors: Mason Ma, Jiajie Wu, Chase Post, Tony Shi, Jingang Yi, Tony Schmitz, Hong Wang

    Abstract: This study presents a physics-informed machine learning-based control method for nonlinear dynamic systems with highly noisy measurements. Existing data-driven control methods that use machine learning for system identification cannot effectively cope with highly noisy measurements, resulting in unstable control performance. To address this challenge, the present study extends current physics-info… ▽ More

    Submitted 22 March, 2025; v1 submitted 11 November, 2023; originally announced November 2023.

    Comments: We completely redesigned and rewrote this paper. It will be a completely different paper with different title, author list, and content

  31. arXiv:2310.09999  [pdf, other

    stat.ML cs.LG eess.SP

    Outlier Detection Using Generative Models with Theoretical Performance Guarantees

    Authors: Jirong Yi, Jingchao Gao, Tianming Wang, Xiaodong Wu, Weiyu Xu

    Abstract: This paper considers the problem of recovering signals modeled by generative models from linear measurements contaminated with sparse outliers. We propose an outlier detection approach for reconstructing the ground-truth signals modeled by generative models under sparse outliers. We establish theoretical recovery guarantees for reconstruction of signals using generative models in the presence of o… ▽ More

    Submitted 15 October, 2023; originally announced October 2023.

    Comments: arXiv admin note: substantial text overlap with arXiv:1810.11335

  32. Dual-Branch Knowledge Distillation for Noise-Robust Synthetic Speech Detection

    Authors: Cunhang Fan, Mingming Ding, Jianhua Tao, Ruibo Fu, Jiangyan Yi, Zhengqi Wen, Zhao Lv

    Abstract: Most research in synthetic speech detection (SSD) focuses on improving performance on standard noise-free datasets. However, in actual situations, noise interference is usually present, causing significant performance degradation in SSD systems. To improve noise robustness, this paper proposes a dual-branch knowledge distillation synthetic speech detection (DKDSSD) method. Specifically, a parallel… ▽ More

    Submitted 16 April, 2024; v1 submitted 13 October, 2023; originally announced October 2023.

  33. arXiv:2310.04010  [pdf, other

    cs.CV cs.AI eess.IV

    Excision And Recovery: Visual Defect Obfuscation Based Self-Supervised Anomaly Detection Strategy

    Authors: YeongHyeon Park, Sungho Kang, Myung Jin Kim, Yeonho Lee, Hyeong Seok Kim, Juneho Yi

    Abstract: Due to scarcity of anomaly situations in the early manufacturing stage, an unsupervised anomaly detection (UAD) approach is widely adopted which only uses normal samples for training. This approach is based on the assumption that the trained UAD model will accurately reconstruct normal patterns but struggles with unseen anomalous patterns. To enhance the UAD performance, reconstruction-by-inpainti… ▽ More

    Submitted 9 November, 2023; v1 submitted 6 October, 2023; originally announced October 2023.

    Comments: 10 pages, 5 figures, 5 tables

  34. arXiv:2310.00014  [pdf, other

    cs.SD eess.AS

    Fewer-token Neural Speech Codec with Time-invariant Codes

    Authors: Yong Ren, Tao Wang, Jiangyan Yi, Le Xu, Jianhua Tao, Chuyuan Zhang, Junzuo Zhou

    Abstract: Language model based text-to-speech (TTS) models, like VALL-E, have gained attention for their outstanding in-context learning capability in zero-shot scenarios. Neural speech codec is a critical component of these models, which can convert speech into discrete token representations. However, excessive token sequences from the codec may negatively affect prediction accuracy and restrict the progre… ▽ More

    Submitted 10 March, 2024; v1 submitted 15 September, 2023; originally announced October 2023.

    Comments: Accepted by ICASSP 2024

  35. arXiv:2309.16720  [pdf, ps, other

    cs.RO eess.SY

    Energy Efficient Foot-Shape Design for Bipedal Walkers on Granular Terrain

    Authors: Xunjie Chen, Jingang Yi, Hao Wang

    Abstract: It is important to understand how bipedal walkers balance and walk effectively on granular materials, such as sand and loose dirt, etc. This paper first presents a computational approach to obtain the motion and energy analysis of bipedal walkers on granular terrains and then discusses an optimization method for the robot foot-shape contour design for energy efficiently walking. We first present t… ▽ More

    Submitted 14 September, 2023; originally announced September 2023.

    Comments: The 3rd Modeling, Estimation and Control Conference (MECC 2023), Lake Tahoe, NV, Oct 2-5 2023

  36. arXiv:2309.15784  [pdf, other

    cs.RO eess.SY

    Gaussian Process-Enhanced, External and Internal Convertible (EIC) Form-Based Control of Underactuated Balance Robots

    Authors: Feng Han, Jingang Yi

    Abstract: External and internal convertible (EIC) form-based motion control (i.e., EIC-based control) is one of the effective approaches for underactuated balance robots. By sequentially controller design, trajectory tracking of the actuated subsystem and balance of the unactuated subsystem can be achieved simultaneously. However, with certain conditions, there exists uncontrolled robot motion under the EIC… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

  37. arXiv:2309.08166  [pdf, other

    cs.SD eess.AS

    Residual Speaker Representation for One-Shot Voice Conversion

    Authors: Le Xu, Jiangyan Yi, Tao Wang, Yong Ren, Rongxiu Zhong, Zhengqi Wen, Jianhua Tao

    Abstract: Recently, there have been significant advancements in voice conversion, resulting in high-quality performance. However, there are still two critical challenges in this field. Firstly, current voice conversion methods have limited robustness when encountering unseen speakers. Secondly, they also have limited ability to control timbre representation. To address these challenges, this paper presents… ▽ More

    Submitted 11 August, 2024; v1 submitted 15 September, 2023; originally announced September 2023.

    Comments: Accepted by INTERSPEECH2024

  38. arXiv:2309.07147  [pdf, other

    eess.SP cs.HC cs.LG cs.MM cs.SD eess.AS

    DGSD: Dynamical Graph Self-Distillation for EEG-Based Auditory Spatial Attention Detection

    Authors: Cunhang Fan, Hongyu Zhang, Wei Huang, Jun Xue, Jianhua Tao, Jiangyan Yi, Zhao Lv, Xiaopei Wu

    Abstract: Auditory Attention Detection (AAD) aims to detect target speaker from brain signals in a multi-speaker environment. Although EEG-based AAD methods have shown promising results in recent years, current approaches primarily rely on traditional convolutional neural network designed for processing Euclidean data like images. This makes it challenging to handle EEG signals, which possess non-Euclidean… ▽ More

    Submitted 7 September, 2023; originally announced September 2023.

  39. arXiv:2309.06780  [pdf, other

    cs.SD eess.AS

    Distinguishing Neural Speech Synthesis Models Through Fingerprints in Speech Waveforms

    Authors: Chu Yuan Zhang, Jiangyan Yi, Jianhua Tao, Chenglong Wang, Xinrui Yan

    Abstract: Recent strides in neural speech synthesis technologies, while enjoying widespread applications, have nonetheless introduced a series of challenges, spurring interest in the defence against the threat of misuse and abuse. Notably, source attribution of synthesized speech has value in forensics and intellectual property protection, but prior work in this area has certain limitations in scope. To add… ▽ More

    Submitted 15 June, 2024; v1 submitted 13 September, 2023; originally announced September 2023.

    Comments: Accepted by CCL 2024

  40. arXiv:2308.14970  [pdf, other

    cs.SD eess.AS

    Audio Deepfake Detection: A Survey

    Authors: Jiangyan Yi, Chenglong Wang, Jianhua Tao, Xiaohui Zhang, Chu Yuan Zhang, Yan Zhao

    Abstract: Audio deepfake detection is an emerging active topic. A growing number of literatures have aimed to study deepfake detection algorithms and achieved effective performance, the problem of which is far from being solved. Although there are some review literatures, there has been no comprehensive survey that provides researchers with a systematic overview of these developments with a unified evaluati… ▽ More

    Submitted 28 August, 2023; originally announced August 2023.

  41. arXiv:2308.14595  [pdf, other

    cs.CV cs.AI cs.LG eess.IV

    Neural Network Training Strategy to Enhance Anomaly Detection Performance: A Perspective on Reconstruction Loss Amplification

    Authors: YeongHyeon Park, Sungho Kang, Myung Jin Kim, Hyeonho Jeong, Hyunkyu Park, Hyeong Seok Kim, Juneho Yi

    Abstract: Unsupervised anomaly detection (UAD) is a widely adopted approach in industry due to rare anomaly occurrences and data imbalance. A desirable characteristic of an UAD model is contained generalization ability which excels in the reconstruction of seen normal patterns but struggles with unseen anomalies. Recent studies have pursued to contain the generalization capability of their UAD models in rec… ▽ More

    Submitted 28 August, 2023; originally announced August 2023.

    Comments: 5 pages, 4 figures, 2 tables

  42. Spatial Reconstructed Local Attention Res2Net with F0 Subband for Fake Speech Detection

    Authors: Cunhang Fan, Jun Xue, Jianhua Tao, Jiangyan Yi, Chenglong Wang, Chengshi Zheng, Zhao Lv

    Abstract: The rhythm of bonafide speech is often difficult to replicate, which causes that the fundamental frequency (F0) of synthetic speech is significantly different from that of real speech. It is expected that the F0 feature contains the discriminative information for the fake speech detection (FSD) task. In this paper, we propose a novel F0 subband for FSD. In addition, to effectively model the F0 sub… ▽ More

    Submitted 8 July, 2024; v1 submitted 19 August, 2023; originally announced August 2023.

    Comments: Accept by Neural Networks

  43. arXiv:2308.03300  [pdf, other

    cs.SD cs.LG eess.AS

    Do You Remember? Overcoming Catastrophic Forgetting for Fake Audio Detection

    Authors: Xiaohui Zhang, Jiangyan Yi, Jianhua Tao, Chenglong Wang, Chuyuan Zhang

    Abstract: Current fake audio detection algorithms have achieved promising performances on most datasets. However, their performance may be significantly degraded when dealing with audio of a different dataset. The orthogonal weight modification to overcome catastrophic forgetting does not consider the similarity of genuine audio across different datasets. To overcome this limitation, we propose a continual… ▽ More

    Submitted 7 August, 2023; originally announced August 2023.

    Comments: 40th Internation Conference on Machine Learning (ICML 2023)

  44. arXiv:2307.08323  [pdf, other

    cs.SD eess.AS

    TST: Time-Sparse Transducer for Automatic Speech Recognition

    Authors: Xiaohui Zhang, Mangui Liang, Zhengkun Tian, Jiangyan Yi, Jianhua Tao

    Abstract: End-to-end model, especially Recurrent Neural Network Transducer (RNN-T), has achieved great success in speech recognition. However, transducer requires a great memory footprint and computing time when processing a long decoding sequence. To solve this problem, we propose a model named time-sparse transducer, which introduces a time-sparse mechanism into transducer. In this mechanism, we obtain th… ▽ More

    Submitted 17 July, 2023; originally announced July 2023.

    Comments: 10 pages

    Journal ref: International Conference on Artificial Intelligence (CICAI 2023)

  45. arXiv:2306.05617  [pdf, other

    cs.SD cs.CL eess.AS

    Low-rank Adaptation Method for Wav2vec2-based Fake Audio Detection

    Authors: Chenglong Wang, Jiangyan Yi, Xiaohui Zhang, Jianhua Tao, Le Xu, Ruibo Fu

    Abstract: Self-supervised speech models are a rapidly developing research topic in fake audio detection. Many pre-trained models can serve as feature extractors, learning richer and higher-level speech features. However,when fine-tuning pre-trained models, there is often a challenge of excessively long training times and high memory consumption, and complete fine-tuning is also very expensive. To alleviate… ▽ More

    Submitted 8 June, 2023; originally announced June 2023.

    Comments: 6pages

    Journal ref: IJCAI 2023 Workshop on Deepfake Audio Detection and Analysis

  46. arXiv:2306.04956  [pdf, other

    cs.SD cs.LG eess.AS

    Adaptive Fake Audio Detection with Low-Rank Model Squeezing

    Authors: Xiaohui Zhang, Jiangyan Yi, Jianhua Tao, Chenlong Wang, Le Xu, Ruibo Fu

    Abstract: The rapid advancement of spoofing algorithms necessitates the development of robust detection methods capable of accurately identifying emerging fake audio. Traditional approaches, such as finetuning on new datasets containing these novel spoofing algorithms, are computationally intensive and pose a risk of impairing the acquired knowledge of known fake audio types. To address these challenges, th… ▽ More

    Submitted 8 June, 2023; originally announced June 2023.

    Journal ref: DADA workshop on IJCAI 2023

  47. arXiv:2305.13774  [pdf, other

    cs.SD eess.AS

    ADD 2023: the Second Audio Deepfake Detection Challenge

    Authors: Jiangyan Yi, Jianhua Tao, Ruibo Fu, Xinrui Yan, Chenglong Wang, Tao Wang, Chu Yuan Zhang, Xiaohui Zhang, Yan Zhao, Yong Ren, Le Xu, Junzuo Zhou, Hao Gu, Zhengqi Wen, Shan Liang, Zheng Lian, Shuai Nie, Haizhou Li

    Abstract: Audio deepfake detection is an emerging topic in the artificial intelligence community. The second Audio Deepfake Detection Challenge (ADD 2023) aims to spur researchers around the world to build new innovative technologies that can further accelerate and foster research on detecting and analyzing deepfake speech utterances. Different from previous challenges (e.g. ADD 2022), ADD 2023 focuses on s… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.

  48. arXiv:2305.13701  [pdf, other

    cs.SD eess.AS

    TO-Rawnet: Improving RawNet with TCN and Orthogonal Regularization for Fake Audio Detection

    Authors: Chenglong Wang, Jiangyan Yi, Jianhua Tao, Chuyuan Zhang, Shuai Zhang, Ruibo Fu, Xun Chen

    Abstract: Current fake audio detection relies on hand-crafted features, which lose information during extraction. To overcome this, recent studies use direct feature extraction from raw audio signals. For example, RawNet is one of the representative works in end-to-end fake audio detection. However, existing work on RawNet does not optimize the parameters of the Sinc-conv during training, which limited its… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.

    Comments: Interspeech2023

  49. arXiv:2305.13700  [pdf, other

    cs.SD eess.AS

    Detection of Cross-Dataset Fake Audio Based on Prosodic and Pronunciation Features

    Authors: Chenglong Wang, Jiangyan Yi, Jianhua Tao, Chuyuan Zhang, Shuai Zhang, Xun Chen

    Abstract: Existing fake audio detection systems perform well in in-domain testing, but still face many challenges in out-of-domain testing. This is due to the mismatch between the training and test data, as well as the poor generalizability of features extracted from limited views. To address this, we propose multi-view features for fake audio detection, which aim to capture more generalized features from p… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.

    Comments: Interspeech2023

  50. arXiv:2303.01211  [pdf, other

    cs.SD cs.LG cs.MM eess.AS

    Learning From Yourself: A Self-Distillation Method for Fake Speech Detection

    Authors: Jun Xue, Cunhang Fan, Jiangyan Yi, Chenglong Wang, Zhengqi Wen, Dan Zhang, Zhao Lv

    Abstract: In this paper, we propose a novel self-distillation method for fake speech detection (FSD), which can significantly improve the performance of FSD without increasing the model complexity. For FSD, some fine-grained information is very important, such as spectrogram defects, mute segments, and so on, which are often perceived by shallow networks. However, shallow networks have much noise, which can… ▽ More

    Submitted 2 March, 2023; originally announced March 2023.

    Comments: Accepted by ICASSP 2023