Skip to main content

Showing 1–50 of 383 results for author: Tian, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.05568  [pdf, ps, other

    cs.CV cs.LG

    ReLayout: Integrating Relation Reasoning for Content-aware Layout Generation with Multi-modal Large Language Models

    Authors: Jiaxu Tian, Xuehui Yu, Yaoxing Wang, Pan Wang, Guangqian Guo, Shan Gao

    Abstract: Content-aware layout aims to arrange design elements appropriately on a given canvas to convey information effectively. Recently, the trend for this task has been to leverage large language models (LLMs) to generate layouts automatically, achieving remarkable performance. However, existing LLM-based methods fail to adequately interpret spatial relationships among visual themes and design elements,… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

  2. arXiv:2507.01224  [pdf, ps, other

    cs.DC

    FLARE: A Dataflow-Aware and Scalable Hardware Architecture for Neural-Hybrid Scientific Lossy Compression

    Authors: Wenqi Jia, Ying Huang, Jian Xu, Zhewen Hu, Sian Jin, Jiannan Tian, Yuede Ji, Miao Yin

    Abstract: Scientific simulation leveraging high-performance computing (HPC) systems is crucial for modeling complex systems and phenomena in fields such as astrophysics, climate science, and fluid dynamics, generating massive datasets that often reach petabyte to exabyte scales. However, managing these vast data volumes introduces significant I/O and network bottlenecks, limiting practical performance and s… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

  3. arXiv:2507.00356  [pdf

    cs.CV cs.AI

    CGEarthEye:A High-Resolution Remote Sensing Vision Foundation Model Based on the Jilin-1 Satellite Constellation

    Authors: Zhiwei Yi, Xin Cheng, Jingyu Ma, Ruifei Zhu, Junwei Tian, Yuanxiu Zhou, Xinge Zhao, Hongzhe Li

    Abstract: Deep learning methods have significantly advanced the development of intelligent rinterpretation in remote sensing (RS), with foundational model research based on large-scale pre-training paradigms rapidly reshaping various domains of Earth Observation (EO). However, compared to the open accessibility and high spatiotemporal coverage of medium-resolution data, the limited acquisition channels for… ▽ More

    Submitted 30 June, 2025; originally announced July 2025.

    Comments: A Remote Sensing Fundation Model for Very High Resolution Images

  4. arXiv:2506.23520  [pdf, ps, other

    cs.AI

    ChemActor: Enhancing Automated Extraction of Chemical Synthesis Actions with LLM-Generated Data

    Authors: Yu Zhang, Ruijie Yu, Jidong Tian, Feng Zhu, Jiapeng Liu, Xiaokang Yang, Yaohui Jin, Yanyan Xu

    Abstract: With the increasing interest in robotic synthesis in the context of organic chemistry, the automated extraction of chemical procedures from literature is critical. However, this task remains challenging due to the inherent ambiguity of chemical language and the high cost of human annotation required for developing reliable computer-aided extraction protocols. Here, we present ChemActor, a fully fi… ▽ More

    Submitted 1 July, 2025; v1 submitted 30 June, 2025; originally announced June 2025.

  5. arXiv:2506.17611  [pdf, ps, other

    cs.CL cs.SD eess.AS

    OpusLM: A Family of Open Unified Speech Language Models

    Authors: Jinchuan Tian, William Chen, Yifan Peng, Jiatong Shi, Siddhant Arora, Shikhar Bharadwaj, Takashi Maekaku, Yusuke Shinohara, Keita Goto, Xiang Yue, Huck Yang, Shinji Watanabe

    Abstract: This paper presents Open Unified Speech Language Models (OpusLMs), a family of open foundational speech language models (SpeechLMs) up to 7B. Initialized from decoder-only text language models, the OpusLMs are continuously pre-trained on 213K hours of speech-text pairs and 292B text-only tokens. We demonstrate our OpusLMs achieve comparable (or even superior) performance with existing SpeechLMs in… ▽ More

    Submitted 21 June, 2025; originally announced June 2025.

  6. arXiv:2506.16201  [pdf, ps, other

    cs.RO cs.CV

    FlowRAM: Grounding Flow Matching Policy with Region-Aware Mamba Framework for Robotic Manipulation

    Authors: Sen Wang, Le Wang, Sanping Zhou, Jingyi Tian, Jiayi Li, Haowen Sun, Wei Tang

    Abstract: Robotic manipulation in high-precision tasks is essential for numerous industrial and real-world applications where accuracy and speed are required. Yet current diffusion-based policy learning methods generally suffer from low computational efficiency due to the iterative denoising process during inference. Moreover, these methods do not fully explore the potential of generative models for enhanci… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

  7. arXiv:2506.13585  [pdf, ps, other

    cs.CL cs.LG

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    Authors: MiniMax, :, Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, Chengjun Xiao, Chengyu Du, Chi Zhang, Chu Qiao, Chunhao Zhang, Chunhui Du, Congchao Guo, Da Chen, Deming Ding, Dianjun Sun, Dong Li, Enwei Jiao, Haigang Zhou , et al. (103 additional authors not shown)

    Abstract: We introduce MiniMax-M1, the world's first open-weight, large-scale hybrid-attention reasoning model. MiniMax-M1 is powered by a hybrid Mixture-of-Experts (MoE) architecture combined with a lightning attention mechanism. The model is developed based on our previous MiniMax-Text-01 model, which contains a total of 456 billion parameters with 45.9 billion parameters activated per token. The M1 model… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

    Comments: A technical report from MiniMax. The authors are listed in alphabetical order. We open-source our MiniMax-M1 at https://github.com/MiniMax-AI/MiniMax-M1

  8. arXiv:2506.09663  [pdf, ps, other

    cs.CV

    Self-Supervised Multi-Part Articulated Objects Modeling via Deformable Gaussian Splatting and Progressive Primitive Segmentation

    Authors: Haowen Wang, Xiaoping Yuan, Zhao Jin, Zhen Zhao, Zhengping Che, Yousong Xue, Jin Tian, Yakun Huang, Jian Tang

    Abstract: Articulated objects are ubiquitous in everyday life, and accurate 3D representations of their geometry and motion are critical for numerous applications. However, in the absence of human annotation, existing approaches still struggle to build a unified representation for objects that contain multiple movable parts. We introduce DeGSS, a unified framework that encodes articulated objects as deforma… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

  9. arXiv:2506.05902  [pdf, ps, other

    cs.LG physics.soc-ph

    A Driving Regime-Embedded Deep Learning Framework for Modeling Intra-Driver Heterogeneity in Multi-Scale Car-Following Dynamics

    Authors: Shirui Zhou, Jiying Yan, Junfang Tian, Tao Wang, Yongfu Li, Shiquan Zhong

    Abstract: A fundamental challenge in car-following modeling lies in accurately representing the multi-scale complexity of driving behaviors, particularly the intra-driver heterogeneity where a single driver's actions fluctuate dynamically under varying conditions. While existing models, both conventional and data-driven, address behavioral heterogeneity to some extent, they often emphasize inter-driver hete… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

  10. arXiv:2506.05767  [pdf, ps, other

    cs.CL cs.AI

    dots.llm1 Technical Report

    Authors: Bi Huo, Bin Tu, Cheng Qin, Da Zheng, Debing Zhang, Dongjie Zhang, En Li, Fu Guo, Jian Yao, Jie Lou, Junfeng Tian, Li Hu, Ran Zhu, Shengdong Chen, Shuo Liu, Su Guang, Te Wo, Weijun Zhang, Xiaoming Shi, Xinxin Peng, Xing Wu, Yawen Liu, Yuqiu Ji, Ze Wen, Zhenhai Liu , et al. (2 additional authors not shown)

    Abstract: Mixture of Experts (MoE) models have emerged as a promising paradigm for scaling language models efficiently by activating only a subset of parameters for each input token. In this report, we present dots.llm1, a large-scale MoE model that activates 14B parameters out of a total of 142B parameters, delivering performance on par with state-of-the-art models while reducing training and inference cos… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

  11. arXiv:2506.04941  [pdf, ps, other

    cs.RO

    ArtVIP: Articulated Digital Assets of Visual Realism, Modular Interaction, and Physical Fidelity for Robot Learning

    Authors: Zhao Jin, Zhengping Che, Zhen Zhao, Kun Wu, Yuheng Zhang, Yinuo Zhao, Zehui Liu, Qiang Zhang, Xiaozhu Ju, Jing Tian, Yousong Xue, Jian Tang

    Abstract: Robot learning increasingly relies on simulation to advance complex ability such as dexterous manipulations and precise interactions, necessitating high-quality digital assets to bridge the sim-to-real gap. However, existing open-source articulated-object datasets for simulation are limited by insufficient visual realism and low physical fidelity, which hinder their utility for training models mas… ▽ More

    Submitted 5 June, 2025; v1 submitted 5 June, 2025; originally announced June 2025.

  12. arXiv:2506.01049  [pdf, ps, other

    cs.LG cs.AI

    Taming LLMs by Scaling Learning Rates with Gradient Grouping

    Authors: Siyuan Li, Juanxi Tian, Zedong Wang, Xin Jin, Zicheng Liu, Wentao Zhang, Dan Xu

    Abstract: Training large language models (LLMs) poses challenges due to their massive scale and heterogeneous architectures. While adaptive optimizers like AdamW help address gradient variations, they still struggle with efficient and effective parameter-wise learning rate estimation, resulting in training instability, slow convergence, and poor compatibility with parameter-efficient fine-tuning (PEFT) tech… ▽ More

    Submitted 1 June, 2025; originally announced June 2025.

    Comments: Preprint version of "Taming LLMs with Gradient Grouping" (ACL'2025). The code will be available at https://github.com/ScalingOpt/SGG

  13. arXiv:2506.00722  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Chain-of-Thought Training for Open E2E Spoken Dialogue Systems

    Authors: Siddhant Arora, Jinchuan Tian, Hayato Futami, Jee-weon Jung, Jiatong Shi, Yosuke Kashiwagi, Emiru Tsunoo, Shinji Watanabe

    Abstract: Unlike traditional cascaded pipelines, end-to-end (E2E) spoken dialogue systems preserve full differentiability and capture non-phonemic information, making them well-suited for modeling spoken interactions. However, existing E2E approaches often require large-scale training data and generates responses lacking semantic coherence. We propose a simple yet effective strategy leveraging a chain-of-th… ▽ More

    Submitted 31 May, 2025; originally announced June 2025.

    Comments: Accepted at INTERSPEECH 2025

  14. arXiv:2506.00338  [pdf, other

    cs.CL cs.SD eess.AS

    OWSM v4: Improving Open Whisper-Style Speech Models via Data Scaling and Cleaning

    Authors: Yifan Peng, Shakeel Muhammad, Yui Sudo, William Chen, Jinchuan Tian, Chyi-Jiunn Lin, Shinji Watanabe

    Abstract: The Open Whisper-style Speech Models (OWSM) project has developed a series of fully open speech foundation models using academic-scale resources, but their training data remains insufficient. This work enhances OWSM by integrating YODAS, a large-scale web-crawled dataset with a Creative Commons license. However, incorporating YODAS is nontrivial due to its wild nature, which introduces challenges… ▽ More

    Submitted 30 May, 2025; originally announced June 2025.

    Comments: Accepted at INTERSPEECH 2025

  15. arXiv:2505.24518  [pdf, ps, other

    cs.SD cs.MM eess.AS

    ARECHO: Autoregressive Evaluation via Chain-Based Hypothesis Optimization for Speech Multi-Metric Estimation

    Authors: Jiatong Shi, Yifan Cheng, Bo-Hao Su, Hye-jin Shim, Jinchuan Tian, Samuele Cornell, Yiwen Zhao, Siddhant Arora, Shinji Watanabe

    Abstract: Speech signal analysis poses significant challenges, particularly in tasks such as speech quality evaluation and profiling, where the goal is to predict multiple perceptual and objective metrics. For instance, metrics like PESQ (Perceptual Evaluation of Speech Quality), STOI (Short-Time Objective Intelligibility), and MOS (Mean Opinion Score) each capture different aspects of speech quality. Howev… ▽ More

    Submitted 30 May, 2025; originally announced May 2025.

  16. arXiv:2505.23966  [pdf, ps, other

    cs.CL

    FLAT-LLM: Fine-grained Low-rank Activation Space Transformation for Large Language Model Compression

    Authors: Jiayi Tian, Ryan Solgi, Jinming Lu, Yifan Yang, Hai Li, Zheng Zhang

    Abstract: Large Language Models (LLMs) have enabled remarkable progress in natural language processing, yet their high computational and memory demands pose challenges for deployment in resource-constrained environments. Although recent low-rank decomposition methods offer a promising path for structural compression, they often suffer from accuracy degradation, expensive calibration procedures, and result i… ▽ More

    Submitted 24 June, 2025; v1 submitted 29 May, 2025; originally announced May 2025.

  17. arXiv:2505.20357  [pdf, ps, other

    cs.LG gr-qc physics.data-an

    Learning and Interpreting Gravitational-Wave Features from CNNs with a Random Forest Approach

    Authors: Jun Tian, He Wang, Jibo He, Yu Pan, Shuo Cao, Qingquan Jiang

    Abstract: Convolutional neural networks (CNNs) have become widely adopted in gravitational wave (GW) detection pipelines due to their ability to automatically learn hierarchical features from raw strain data. However, the physical meaning of these learned features remains underexplored, limiting the interpretability of such models. In this work, we propose a hybrid architecture that combines a CNN-based fea… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

  18. arXiv:2505.19437  [pdf, ps, other

    cs.SD eess.AS

    RA-CLAP: Relation-Augmented Emotional Speaking Style Contrastive Language-Audio Pretraining For Speech Retrieval

    Authors: Haoqin Sun, Jingguang Tian, Jiaming Zhou, Hui Wang, Jiabei He, Shiwan Zhao, Xiangyu Kong, Desheng Hu, Xinkang Xu, Xinhui Hu, Yong Qin

    Abstract: The Contrastive Language-Audio Pretraining (CLAP) model has demonstrated excellent performance in general audio description-related tasks, such as audio retrieval. However, in the emerging field of emotional speaking style description (ESSD), cross-modal contrastive pretraining remains largely unexplored. In this paper, we propose a novel speech retrieval task called emotional speaking style retri… ▽ More

    Submitted 25 May, 2025; originally announced May 2025.

  19. arXiv:2505.19425  [pdf, ps, other

    cs.CV cs.CR cs.LG

    Structure Disruption: Subverting Malicious Diffusion-Based Inpainting via Self-Attention Query Perturbation

    Authors: Yuhao He, Jinyu Tian, Haiwei Wu, Jianqing Li

    Abstract: The rapid advancement of diffusion models has enhanced their image inpainting and editing capabilities but also introduced significant societal risks. Adversaries can exploit user images from social media to generate misleading or harmful content. While adversarial perturbations can disrupt inpainting, global perturbation-based methods fail in mask-guided editing tasks due to spatial constraints.… ▽ More

    Submitted 25 May, 2025; originally announced May 2025.

  20. arXiv:2505.18985  [pdf, ps, other

    cs.LG cs.CL cs.CV

    STRICT: Stress Test of Rendering Images Containing Text

    Authors: Tianyu Zhang, Xinyu Wang, Zhenghan Tai, Lu Li, Jijun Chi, Jingrui Tian, Hailin He, Suyuchen Wang

    Abstract: While diffusion models have revolutionized text-to-image generation with their ability to synthesize realistic and diverse scenes, they continue to struggle to generate consistent and legible text within images. This shortcoming is commonly attributed to the locality bias inherent in diffusion-based generation, which limits their ability to model long-range spatial dependencies. In this paper, we… ▽ More

    Submitted 25 May, 2025; originally announced May 2025.

    Comments: 13 pages

  21. arXiv:2505.18730  [pdf, other

    cs.CV

    Align Beyond Prompts: Evaluating World Knowledge Alignment in Text-to-Image Generation

    Authors: Wenchao Zhang, Jiahe Tian, Runze He, Jizhong Han, Jiao Dai, Miaomiao Feng, Wei Mi, Xiaodan Zhang

    Abstract: Recent text-to-image (T2I) generation models have advanced significantly, enabling the creation of high-fidelity images from textual prompts. However, existing evaluation benchmarks primarily focus on the explicit alignment between generated images and prompts, neglecting the alignment with real-world knowledge beyond prompts. To address this gap, we introduce Align Beyond Prompts (ABP), a compreh… ▽ More

    Submitted 24 May, 2025; originally announced May 2025.

    Comments: Code: https://github.com/smile365317/ABP

  22. arXiv:2505.18498  [pdf, other

    cs.SD eess.AS

    Learning Emotion-Invariant Speaker Representations for Speaker Verification

    Authors: Jingguang Tian, Xinhui Hu, Xinkang Xu

    Abstract: In recent years, the rapid progress in speaker verification (SV) technology has been driven by the extraction of speaker representations based on deep learning. However, such representations are still vulnerable to emotion variability. To address this issue, we propose multiple improvements to train speaker encoders to increase emotion robustness. Firstly, we utilize CopyPaste-based data augmentat… ▽ More

    Submitted 24 May, 2025; originally announced May 2025.

  23. arXiv:2505.14989  [pdf, other

    cs.SD eess.AS

    Discrete Audio Representations for Automated Audio Captioning

    Authors: Jingguang Tian, Haoqin Sun, Xinhui Hu, Xinkang Xu

    Abstract: Discrete audio representations, termed audio tokens, are broadly categorized into semantic and acoustic tokens, typically generated through unsupervised tokenization of continuous audio representations. However, their applicability to automated audio captioning (AAC) remains underexplored. This paper systematically investigates the viability of audio token-driven models for AAC through comparative… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

    Comments: Interspeech 2025

  24. arXiv:2505.13843  [pdf, ps, other

    eess.AS cs.SD

    A Semantic Information-based Hierarchical Speech Enhancement Method Using Factorized Codec and Diffusion Model

    Authors: Yang Xiang, Canan Huang, Desheng Hu, Jingguang Tian, Xinhui Hu, Chao Zhang

    Abstract: Most current speech enhancement (SE) methods recover clean speech from noisy inputs by directly estimating time-frequency masks or spectrums. However, these approaches often neglect the distinct attributes, such as semantic content and acoustic details, inherent in speech signals, which can hinder performance in downstream tasks. Moreover, their effectiveness tends to degrade in complex acoustic e… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

    Comments: Accepted by interspeech 2025

  25. arXiv:2505.12398  [pdf, other

    cs.CL cs.AI cs.LG

    Traversal Verification for Speculative Tree Decoding

    Authors: Yepeng Weng, Qiao Hu, Xujie Chen, Li Liu, Dianwen Mei, Huishi Qiu, Jiang Tian, Zhongchao Shi

    Abstract: Speculative decoding is a promising approach for accelerating large language models. The primary idea is to use a lightweight draft model to speculate the output of the target model for multiple subsequent timesteps, and then verify them in parallel to determine whether the drafted tokens should be accepted or rejected. To enhance acceptance rates, existing frameworks typically construct token tre… ▽ More

    Submitted 18 May, 2025; originally announced May 2025.

    Comments: Under review

  26. arXiv:2505.04983  [pdf, ps, other

    stat.ME cs.AI

    Decomposition of Probabilities of Causation with Two Mediators

    Authors: Yuta Kawakami, Jin Tian

    Abstract: Mediation analysis for probabilities of causation (PoC) provides a fundamental framework for evaluating the necessity and sufficiency of treatment in provoking an event through different causal pathways. One of the primary objectives of causal mediation analysis is to decompose the total effect into path-specific components. In this study, we investigate the path-specific probability of necessity… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

    Comments: arXiv admin note: text overlap with arXiv:2412.14491

  27. arXiv:2505.04971  [pdf, ps, other

    stat.ME cs.AI

    Moments of Causal Effects

    Authors: Yuta Kawakami, Jin Tian

    Abstract: The moments of random variables are fundamental statistical measures for characterizing the shape of a probability distribution, encompassing metrics such as mean, variance, skewness, and kurtosis. Additionally, the product moments, including covariance and correlation, reveal the relationships between multiple random variables. On the other hand, the primary focus of causal inference is the evalu… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

  28. arXiv:2504.21385  [pdf, other

    cs.CV

    IDDM: Bridging Synthetic-to-Real Domain Gap from Physics-Guided Diffusion for Real-world Image Dehazing

    Authors: Shijun Zhou, Yajing Liu, Chunhui Hao, Zhiyuan Liu, Jiandong Tian

    Abstract: Due to the domain gap between real-world and synthetic hazy images, current data-driven dehazing algorithms trained on synthetic datasets perform well on synthetic data but struggle to generalize to real-world scenarios. To address this challenge, we propose \textbf{I}mage \textbf{D}ehazing \textbf{D}iffusion \textbf{M}odels (IDDM), a novel diffusion process that incorporates the atmospheric scatt… ▽ More

    Submitted 30 April, 2025; originally announced April 2025.

  29. arXiv:2504.17789  [pdf, other

    cs.CV

    Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models

    Authors: Xu Ma, Peize Sun, Haoyu Ma, Hao Tang, Chih-Yao Ma, Jialiang Wang, Kunpeng Li, Xiaoliang Dai, Yujun Shi, Xuan Ju, Yushi Hu, Artsiom Sanakoyeu, Felix Juefei-Xu, Ji Hou, Junjiao Tian, Tao Xu, Tingbo Hou, Yen-Cheng Liu, Zecheng He, Zijian He, Matt Feiszli, Peizhao Zhang, Peter Vajda, Sam Tsai, Yun Fu

    Abstract: Autoregressive (AR) models, long dominant in language generation, are increasingly applied to image synthesis but are often considered less competitive than Diffusion-based models. A primary limitation is the substantial number of image tokens required for AR models, which constrains both training and inference efficiency, as well as image resolution. To address this, we present Token-Shuffle, a n… ▽ More

    Submitted 27 April, 2025; v1 submitted 24 April, 2025; originally announced April 2025.

    Comments: Project Page: https://ma-xu.github.io/token-shuffle/ Add related works

  30. arXiv:2504.14493  [pdf, ps, other

    cs.IR cs.AI cs.LG

    FinSage: A Multi-aspect RAG System for Financial Filings Question Answering

    Authors: Xinyu Wang, Jijun Chi, Zhenghan Tai, Tung Sum Thomas Kwok, Muzhi Li, Zhuhong Li, Hailin He, Yuchen Hua, Peng Lu, Suyuchen Wang, Yihong Wu, Jerry Huang, Jingrui Tian, Fengran Mo, Yufei Cui, Ling Zhou

    Abstract: Leveraging large language models in real-world settings often entails a need to utilize domain-specific data and tools in order to follow the complex regulations that need to be followed for acceptable use. Within financial sectors, modern enterprises increasingly rely on Retrieval-Augmented Generation (RAG) systems to address complex compliance requirements in financial document workflows. Howeve… ▽ More

    Submitted 6 June, 2025; v1 submitted 20 April, 2025; originally announced April 2025.

  31. arXiv:2504.12737  [pdf, other

    cs.CL

    Chinese-Vicuna: A Chinese Instruction-following Llama-based Model

    Authors: Chenghao Fan, Zhenyi Lu, Jie Tian

    Abstract: Chinese-Vicuna is an open-source, resource-efficient language model designed to bridge the gap in Chinese instruction-following capabilities by fine-tuning Meta's LLaMA architecture using Low-Rank Adaptation (LoRA). Targeting low-resource environments, it enables cost-effective deployment on consumer GPUs (e.g., RTX-2080Ti for 7B models) and supports domain-specific adaptation in fields like healt… ▽ More

    Submitted 17 April, 2025; originally announced April 2025.

    Comments: Chinese-Vicuna Technique Report

  32. arXiv:2504.06474  [pdf, other

    cs.AR

    FETTA: Flexible and Efficient Hardware Accelerator for Tensorized Neural Network Training

    Authors: Jinming Lu, Jiayi Tian, Hai Li, Ian Young, Zheng Zhang

    Abstract: The increasing demand for on-device training of deep neural networks (DNNs) aims to leverage personal data for high-performance applications while addressing privacy concerns and reducing communication latency. However, resource-constrained platforms face significant challenges due to the intensive computational and memory demands of DNN training. Tensor decomposition emerges as a promising approa… ▽ More

    Submitted 8 April, 2025; originally announced April 2025.

  33. arXiv:2504.05692  [pdf, other

    eess.IV cs.CV

    POMATO: Marrying Pointmap Matching with Temporal Motion for Dynamic 3D Reconstruction

    Authors: Songyan Zhang, Yongtao Ge, Jinyuan Tian, Guangkai Xu, Hao Chen, Chen Lv, Chunhua Shen

    Abstract: 3D reconstruction in dynamic scenes primarily relies on the combination of geometry estimation and matching modules where the latter task is pivotal for distinguishing dynamic regions which can help to mitigate the interference introduced by camera and object motion. Furthermore, the matching module explicitly models object motion, enabling the tracking of specific targets and advancing motion und… ▽ More

    Submitted 8 April, 2025; originally announced April 2025.

    Comments: code: https://github.com/wyddmw/POMATO

  34. arXiv:2504.04374  [pdf, other

    cs.CR cs.AI

    iADCPS: Time Series Anomaly Detection for Evolving Cyber-physical Systems via Incremental Meta-learning

    Authors: Jiyu Tian, Mingchu Li, Liming Chen, Zumin Wang

    Abstract: Anomaly detection for cyber-physical systems (ADCPS) is crucial in identifying faults and potential attacks by analyzing the time series of sensor measurements and actuator states. However, current methods lack adaptation to data distribution shifts in both temporal and spatial dimensions as cyber-physical systems evolve. To tackle this issue, we propose an incremental meta-learning-based approach… ▽ More

    Submitted 6 April, 2025; originally announced April 2025.

  35. arXiv:2504.01561  [pdf, other

    eess.IV cs.CV

    STPNet: Scale-aware Text Prompt Network for Medical Image Segmentation

    Authors: Dandan Shan, Zihan Li, Yunxiang Li, Qingde Li, Jie Tian, Qingqi Hong

    Abstract: Accurate segmentation of lesions plays a critical role in medical image analysis and diagnosis. Traditional segmentation approaches that rely solely on visual features often struggle with the inherent uncertainty in lesion distribution and size. To address these issues, we propose STPNet, a Scale-aware Text Prompt Network that leverages vision-language modeling to enhance medical image segmentatio… ▽ More

    Submitted 2 April, 2025; originally announced April 2025.

  36. arXiv:2504.00999  [pdf, other

    cs.CV cs.AI

    MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization

    Authors: Siyuan Li, Luyuan Zhang, Zedong Wang, Juanxi Tian, Cheng Tan, Zicheng Liu, Chang Yu, Qingsong Xie, Haonan Lu, Haoqian Wang, Zhen Lei

    Abstract: Masked Image Modeling (MIM) with Vector Quantization (VQ) has achieved great success in both self-supervised pre-training and image generation. However, most existing methods struggle to address the trade-off in shared latent space for generation quality vs. representation learning and efficiency. To push the limits of this paradigm, we propose MergeVQ, which incorporates token merging techniques… ▽ More

    Submitted 1 April, 2025; originally announced April 2025.

    Comments: CVPR2025 (in process for more analysis and extension)

  37. arXiv:2503.22205  [pdf, other

    cs.LG cs.CV

    Data-Free Universal Attack by Exploiting the Intrinsic Vulnerability of Deep Models

    Authors: YangTian Yan, Jinyu Tian

    Abstract: Deep neural networks (DNNs) are susceptible to Universal Adversarial Perturbations (UAPs), which are instance agnostic perturbations that can deceive a target model across a wide range of samples. Unlike instance-specific adversarial examples, UAPs present a greater challenge as they must generalize across different samples and models. Generating UAPs typically requires access to numerous examples… ▽ More

    Submitted 28 March, 2025; originally announced March 2025.

    Comments: Accepted in AAAI 2025

  38. arXiv:2503.20322  [pdf, other

    cs.CV

    Dynamic Pyramid Network for Efficient Multimodal Large Language Model

    Authors: Hao Ai, Kunyi Wang, Zezhou Wang, Hao Lu, Jin Tian, Yaxin Luo, Peng Xing, Jen-Yuan Huang, Huaxia Li, Gen luo

    Abstract: Multimodal large language models (MLLMs) have demonstrated impressive performance in various vision-language (VL) tasks, but their expensive computations still limit the real-world application. To address this issue, recent efforts aim to compress the visual features to save the computational costs of MLLMs. However, direct visual compression methods, e.g. efficient projectors, inevitably destroy… ▽ More

    Submitted 24 April, 2025; v1 submitted 26 March, 2025; originally announced March 2025.

  39. arXiv:2503.20031  [pdf, other

    astro-ph.IM cs.CE

    Lossy Compression of Scientific Data: Applications Constrains and Requirements

    Authors: Franck Cappello, Allison Baker, Ebru Bozda, Martin Burtscher, Kyle Chard, Sheng Di, Paul Christopher O Grady, Peng Jiang, Shaomeng Li, Erik Lindahl, Peter Lindstrom, Magnus Lundborg, Kai Zhao, Xin Liang, Masaru Nagaso, Kento Sato, Amarjit Singh, Seung Woo Son, Dingwen Tao, Jiannan Tian, Robert Underwood, Kazutomo Yoshii, Danylo Lykov, Yuri Alexeev, Kyle Gerard Felker

    Abstract: Increasing data volumes from scientific simulations and instruments (supercomputers, accelerators, telescopes) often exceed network, storage, and analysis capabilities. The scientific community's response to this challenge is scientific data reduction. Reduction can take many forms, such as triggering, sampling, filtering, quantization, and dimensionality reduction. This report focuses on a specif… ▽ More

    Submitted 25 March, 2025; originally announced March 2025.

    Comments: 33 pages

  40. arXiv:2503.19889  [pdf

    cond-mat.mtrl-sci cs.RO

    A Multi-Agent Framework Integrating Large Language Models and Generative AI for Accelerated Metamaterial Design

    Authors: Jie Tian, Martin Taylor Sobczak, Dhanush Patil, Jixin Hou, Lin Pang, Arunachalam Ramanathan, Libin Yang, Xianyan Chen, Yuval Golan, Xiaoming Zhai, Hongyue Sun, Kenan Song, Xianqiao Wang

    Abstract: Metamaterials, renowned for their exceptional mechanical, electromagnetic, and thermal properties, hold transformative potential across diverse applications, yet their design remains constrained by labor-intensive trial-and-error methods and limited data interoperability. Here, we introduce CrossMatAgent -- a novel multi-agent framework that synergistically integrates large language models with st… ▽ More

    Submitted 6 April, 2025; v1 submitted 25 March, 2025; originally announced March 2025.

  41. arXiv:2503.17407  [pdf, other

    cs.CL cs.LG

    A Comprehensive Survey on Long Context Language Modeling

    Authors: Jiaheng Liu, Dawei Zhu, Zhiqi Bai, Yancheng He, Huanxuan Liao, Haoran Que, Zekun Wang, Chenchen Zhang, Ge Zhang, Jiebin Zhang, Yuanxing Zhang, Zhuo Chen, Hangyu Guo, Shilong Li, Ziqiang Liu, Yong Shan, Yifan Song, Jiayi Tian, Wenhao Wu, Zhejian Zhou, Ruijie Zhu, Junlan Feng, Yang Gao, Shizhu He, Zhoujun Li , et al. (12 additional authors not shown)

    Abstract: Efficient processing of long contexts has been a persistent pursuit in Natural Language Processing. With the growing number of long documents, dialogues, and other textual data, it is important to develop Long Context Language Models (LCLMs) that can process and analyze extensive inputs in an effective and efficient way. In this paper, we present a comprehensive survey on recent advances in long-c… ▽ More

    Submitted 20 March, 2025; originally announced March 2025.

  42. arXiv:2503.13435  [pdf, other

    cs.CV

    WideRange4D: Enabling High-Quality 4D Reconstruction with Wide-Range Movements and Scenes

    Authors: Ling Yang, Kaixin Zhu, Juanxi Tian, Bohan Zeng, Mingbao Lin, Hongjuan Pei, Wentao Zhang, Shuicheng Yan

    Abstract: With the rapid development of 3D reconstruction technology, research in 4D reconstruction is also advancing, existing 4D reconstruction methods can generate high-quality 4D scenes. However, due to the challenges in acquiring multi-view video data, the current 4D reconstruction benchmarks mainly display actions performed in place, such as dancing, within limited scenarios. In practical scenarios, m… ▽ More

    Submitted 29 April, 2025; v1 submitted 17 March, 2025; originally announced March 2025.

    Comments: Project: https://github.com/Gen-Verse/WideRange4D

  43. arXiv:2503.09294  [pdf, other

    cs.CV

    IQPFR: An Image Quality Prior for Blind Face Restoration and Beyond

    Authors: Peng Hu, Chunming He, Lei Xu, Jingduo Tian, Sina Farsiu, Yulun Zhang, Pei Liu, Xiu Li

    Abstract: Blind Face Restoration (BFR) addresses the challenge of reconstructing degraded low-quality (LQ) facial images into high-quality (HQ) outputs. Conventional approaches predominantly rely on learning feature representations from ground-truth (GT) data; however, inherent imperfections in GT datasets constrain restoration performance to the mean quality level of the training data, rather than attainin… ▽ More

    Submitted 12 March, 2025; originally announced March 2025.

  44. arXiv:2503.08533  [pdf, other

    cs.CL cs.SD eess.AS

    ESPnet-SDS: Unified Toolkit and Demo for Spoken Dialogue Systems

    Authors: Siddhant Arora, Yifan Peng, Jiatong Shi, Jinchuan Tian, William Chen, Shikhar Bharadwaj, Hayato Futami, Yosuke Kashiwagi, Emiru Tsunoo, Shuichiro Shimizu, Vaibhav Srivastav, Shinji Watanabe

    Abstract: Advancements in audio foundation models (FMs) have fueled interest in end-to-end (E2E) spoken dialogue systems, but different web interfaces for each system makes it challenging to compare and contrast them effectively. Motivated by this, we introduce an open-source, user-friendly toolkit designed to build unified web interfaces for various cascaded and E2E spoken dialogue systems. Our demo furthe… ▽ More

    Submitted 11 March, 2025; originally announced March 2025.

    Comments: Accepted at NAACL 2025 Demo Track

  45. arXiv:2503.05808  [pdf, other

    cs.AI cs.LG cs.RO

    DriveGen: Towards Infinite Diverse Traffic Scenarios with Large Models

    Authors: Shenyu Zhang, Jiaguo Tian, Zhengbang Zhu, Shan Huang, Jucheng Yang, Weinan Zhang

    Abstract: Microscopic traffic simulation has become an important tool for autonomous driving training and testing. Although recent data-driven approaches advance realistic behavior generation, their learning still relies primarily on a single real-world dataset, which limits their diversity and thereby hinders downstream algorithm optimization. In this paper, we propose DriveGen, a novel traffic simulation… ▽ More

    Submitted 4 March, 2025; originally announced March 2025.

    Comments: 8 pages, 3 figures

  46. arXiv:2503.05595  [pdf, other

    cs.CV

    Anti-Diffusion: Preventing Abuse of Modifications of Diffusion-Based Models

    Authors: Zheng Li, Liangbin Xie, Jiantao Zhou, Xintao Wang, Haiwei Wu, Jinyu Tian

    Abstract: Although diffusion-based techniques have shown remarkable success in image generation and editing tasks, their abuse can lead to severe negative social impacts. Recently, some works have been proposed to provide defense against the abuse of diffusion-based methods. However, their protection may be limited in specific scenarios by manually defined prompts or the stable diffusion (SD) version. Furth… ▽ More

    Submitted 7 March, 2025; originally announced March 2025.

  47. arXiv:2503.03528  [pdf, other

    cs.CV cs.AI

    AdaSin: Enhancing Hard Sample Metrics with Dual Adaptive Penalty for Face Recognition

    Authors: Qiqi Guo, Zhuowen Zheng, Guanghua Yang, Zhiquan Liu, Xiaofan Li, Jianqing Li, Jinyu Tian, Xueyuan Gong

    Abstract: In recent years, the emergence of deep convolutional neural networks has positioned face recognition as a prominent research focus in computer vision. Traditional loss functions, such as margin-based, hard-sample mining-based, and hybrid approaches, have achieved notable performance improvements, with some leveraging curriculum learning to optimize training. However, these methods often fall short… ▽ More

    Submitted 5 March, 2025; originally announced March 2025.

  48. arXiv:2503.02332  [pdf, other

    eess.IV cs.CV

    COMMA: Coordinate-aware Modulated Mamba Network for 3D Dispersed Vessel Segmentation

    Authors: Gen Shi, Hui Zhang, Jie Tian

    Abstract: Accurate segmentation of 3D vascular structures is essential for various medical imaging applications. The dispersed nature of vascular structures leads to inherent spatial uncertainty and necessitates location awareness, yet most current 3D medical segmentation models rely on the patch-wise training strategy that usually loses this spatial context. In this study, we introduce the Coordinate-aware… ▽ More

    Submitted 14 March, 2025; v1 submitted 4 March, 2025; originally announced March 2025.

  49. arXiv:2503.00948  [pdf, other

    cs.CV

    Extrapolating and Decoupling Image-to-Video Generation Models: Motion Modeling is Easier Than You Think

    Authors: Jie Tian, Xiaoye Qu, Zhenyi Lu, Wei Wei, Sichen Liu, Yu Cheng

    Abstract: Image-to-Video (I2V) generation aims to synthesize a video clip according to a given image and condition (e.g., text). The key challenge of this task lies in simultaneously generating natural motions while preserving the original appearance of the images. However, current I2V diffusion models (I2V-DMs) often produce videos with limited motion degrees or exhibit uncontrollable motion that conflicts… ▽ More

    Submitted 2 March, 2025; originally announced March 2025.

    Comments: Accepted by CVPR2025

    MSC Class: 68T45 ACM Class: I.2.10

    Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  50. arXiv:2502.16880  [pdf, other

    cs.CL cs.AI cs.LG

    CORAL: Learning Consistent Representations across Multi-step Training with Lighter Speculative Drafter

    Authors: Yepeng Weng, Dianwen Mei, Huishi Qiu, Xujie Chen, Li Liu, Jiang Tian, Zhongchao Shi

    Abstract: Speculative decoding is a powerful technique that accelerates Large Language Model (LLM) inference by leveraging a lightweight speculative draft model. However, existing designs suffers in performance due to misalignment between training and inference. Recent methods have tried to solve this issue by adopting a multi-step training strategy, but the complex inputs of different training steps make i… ▽ More

    Submitted 25 May, 2025; v1 submitted 24 February, 2025; originally announced February 2025.

    Comments: Accepted to ACL 2025 main conference