Skip to main content

Showing 1–50 of 618 results for author: Jiang, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.10010  [pdf, ps, other

    cs.LG

    ImagineBench: Evaluating Reinforcement Learning with Large Language Model Rollouts

    Authors: Jing-Cheng Pang, Kaiyuan Li, Yidi Wang, Si-Hang Yang, Shengyi Jiang, Yang Yu

    Abstract: A central challenge in reinforcement learning (RL) is its dependence on extensive real-world interaction data to learn task-specific policies. While recent work demonstrates that large language models (LLMs) can mitigate this limitation by generating synthetic experience (noted as imaginary rollouts) for mastering novel tasks, progress in this emerging field is hindered due to the lack of a standa… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

  2. arXiv:2505.07916  [pdf, ps, other

    eess.AS cs.SD

    MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder

    Authors: Bowen Zhang, Congchao Guo, Geng Yang, Hang Yu, Haozhe Zhang, Heidi Lei, Jialong Mai, Junjie Yan, Kaiyue Yang, Mingqi Yang, Peikai Huang, Ruiyang Jin, Sitan Jiang, Weihua Cheng, Yawei Li, Yichen Xiao, Yiying Zhou, Yongmao Zhang, Yuan Lu, Yucen He

    Abstract: We introduce MiniMax-Speech, an autoregressive Transformer-based Text-to-Speech (TTS) model that generates high-quality speech. A key innovation is our learnable speaker encoder, which extracts timbre features from a reference audio without requiring its transcription. This enables MiniMax-Speech to produce highly expressive speech with timbre consistent with the reference in a zero-shot manner, w… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

  3. arXiv:2505.07455  [pdf, ps, other

    cs.RO

    GelFusion: Enhancing Robotic Manipulation under Visual Constraints via Visuotactile Fusion

    Authors: Shulong Jiang, Shiqi Zhao, Yuxuan Fan, Peng Yin

    Abstract: Visuotactile sensing offers rich contact information that can help mitigate performance bottlenecks in imitation learning, particularly under vision-limited conditions, such as ambiguous visual cues or occlusions. Effectively fusing visual and visuotactile modalities, however, presents ongoing challenges. We introduce GelFusion, a framework designed to enhance policies by integrating visuotactile… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

  4. arXiv:2505.04921  [pdf, other

    cs.CV cs.CL

    Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models

    Authors: Yunxin Li, Zhenyu Liu, Zitao Li, Xuanyu Zhang, Zhenran Xu, Xinyu Chen, Haoyuan Shi, Shenyuan Jiang, Xintong Wang, Jifang Wang, Shouzheng Huang, Xinping Zhao, Borui Jiang, Lanqing Hong, Longyue Wang, Zhuotao Tian, Baoxing Huai, Wenhan Luo, Weihua Luo, Zheng Zhang, Baotian Hu, Min Zhang

    Abstract: Reasoning lies at the heart of intelligence, shaping the ability to make decisions, draw conclusions, and generalize across domains. In artificial intelligence, as systems increasingly operate in open, uncertain, and multimodal environments, reasoning becomes essential for enabling robust and adaptive behavior. Large Multimodal Reasoning Models (LMRMs) have emerged as a promising paradigm, integra… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

    Comments: 75 Pages,10 figures; Project: https://github.com/HITsz-TMG/Awesome-Large-Multimodal-Reasoning-Models

  5. arXiv:2505.02159  [pdf, other

    cs.CV

    Small Clips, Big Gains: Learning Long-Range Refocused Temporal Information for Video Super-Resolution

    Authors: Xingyu Zhou, Wei Long, Jingbo Lu, Shiyin Jiang, Weiyi You, Haifeng Wu, Shuhang Gu

    Abstract: Video super-resolution (VSR) can achieve better performance compared to single image super-resolution by additionally leveraging temporal information. In particular, the recurrent-based VSR model exploits long-range temporal information during inference and achieves superior detail restoration. However, effectively learning these long-term dependencies within long videos remains a key challenge. T… ▽ More

    Submitted 4 May, 2025; originally announced May 2025.

    Comments: 15 pages, 11 figures

  6. arXiv:2505.01743  [pdf, other

    cs.CV cs.AI cs.LG

    An LLM-Empowered Low-Resolution Vision System for On-Device Human Behavior Understanding

    Authors: Siyang Jiang, Bufang Yang, Lilin Xu, Mu Yuan, Yeerzhati Abudunuer, Kaiwei Liu, Liekang Zeng, Hongkai Chen, Zhenyu Yan, Xiaofan Jiang, Guoliang Xing

    Abstract: The rapid advancements in Large Vision Language Models (LVLMs) offer the potential to surpass conventional labeling by generating richer, more detailed descriptions of on-device human behavior understanding (HBU) in low-resolution vision systems, such as depth, thermal, and infrared. However, existing large vision language model (LVLM) approaches are unable to understand low-resolution data well a… ▽ More

    Submitted 3 May, 2025; originally announced May 2025.

  7. arXiv:2505.01433  [pdf, other

    q-bio.QM cs.CL cs.LG

    Enhancing TCR-Peptide Interaction Prediction with Pretrained Language Models and Molecular Representations

    Authors: Cong Qi, Hanzhang Fang, Siqi jiang, Tianxing Hu, Wei Zhi

    Abstract: Understanding the binding specificity between T-cell receptors (TCRs) and peptide-major histocompatibility complexes (pMHCs) is central to immunotherapy and vaccine development. However, current predictive models struggle with generalization, especially in data-scarce settings and when faced with novel epitopes. We present LANTERN (Large lAnguage model-powered TCR-Enhanced Recognition Network), a… ▽ More

    Submitted 22 April, 2025; originally announced May 2025.

  8. arXiv:2505.00742  [pdf, other

    cs.CV cs.AI eess.IV

    Zoomer: Adaptive Image Focus Optimization for Black-box MLLM

    Authors: Jiaxu Qian, Chendong Wang, Yifan Yang, Chaoyun Zhang, Huiqiang Jiang, Xufang Luo, Yu Kang, Qingwei Lin, Anlan Zhang, Shiqi Jiang, Ting Cao, Tianjun Mao, Suman Banerjee, Guyue Liu, Saravan Rajmohan, Dongmei Zhang, Yuqing Yang, Qi Zhang, Lili Qiu

    Abstract: Recent advancements in multimodal large language models (MLLMs) have broadened the scope of vision-language tasks, excelling in applications like image captioning and interactive question-answering. However, these models struggle with accurately processing visual data, particularly in tasks requiring precise object recognition and fine visual details. Stringent token limits often result in the omi… ▽ More

    Submitted 29 April, 2025; originally announced May 2025.

  9. arXiv:2505.00308  [pdf

    cs.CV cs.AI stat.AP

    AI-Assisted Decision-Making for Clinical Assessment of Auto-Segmented Contour Quality

    Authors: Biling Wang, Austen Maniscalco, Ti Bai, Siqiu Wang, Michael Dohopolski, Mu-Han Lin, Chenyang Shen, Dan Nguyen, Junzhou Huang, Steve Jiang, Xinlei Wang

    Abstract: Purpose: This study presents a Deep Learning (DL)-based quality assessment (QA) approach for evaluating auto-generated contours (auto-contours) in radiotherapy, with emphasis on Online Adaptive Radiotherapy (OART). Leveraging Bayesian Ordinal Classification (BOC) and calibrated uncertainty thresholds, the method enables confident QA predictions without relying on ground truth contours or extensive… ▽ More

    Submitted 11 May, 2025; v1 submitted 1 May, 2025; originally announced May 2025.

  10. arXiv:2505.00254  [pdf, ps, other

    cs.CV cs.AI

    Empowering Agentic Video Analytics Systems with Video Language Models

    Authors: Yuxuan Yan, Shiqi Jiang, Ting Cao, Yifan Yang, Qianqian Yang, Yuanchao Shu, Yuqing Yang, Lili Qiu

    Abstract: AI-driven video analytics has become increasingly pivotal across diverse domains. However, existing systems are often constrained to specific, predefined tasks, limiting their adaptability in open-ended analytical scenarios. The recent emergence of Video-Language Models (VLMs) as transformative technologies offers significant potential for enabling open-ended video understanding, reasoning, and an… ▽ More

    Submitted 1 May, 2025; v1 submitted 30 April, 2025; originally announced May 2025.

    Comments: 15 pages, AVAS

  11. arXiv:2504.20383  [pdf, other

    cs.CV eess.IV

    Neural Stereo Video Compression with Hybrid Disparity Compensation

    Authors: Shiyin Jiang, Zhenghao Chen, Minghao Han, Xingyu Zhou, Leheng Zhang, Shuhang Gu

    Abstract: Disparity compensation represents the primary strategy in stereo video compression (SVC) for exploiting cross-view redundancy. These mechanisms can be broadly categorized into two types: one that employs explicit horizontal shifting, and another that utilizes an implicit cross-attention mechanism to reduce cross-view disparity redundancy. In this work, we propose a hybrid disparity compensation (H… ▽ More

    Submitted 28 April, 2025; originally announced April 2025.

  12. arXiv:2504.16956  [pdf, other

    cs.CL cs.LG q-bio.GN

    Bidirectional Mamba for Single-Cell Data: Efficient Context Learning with Biological Fidelity

    Authors: Cong Qi, Hanzhang Fang, Tianxing Hu, Siqi Jiang, Wei Zhi

    Abstract: Single-cell RNA sequencing (scRNA-seq) enables high-resolution analysis of cellular heterogeneity, but its complexity, which is marked by high dimensionality, sparsity, and batch effects, which poses major computational challenges. Transformer-based models have made significant advances in this domain but are often limited by their quadratic complexity and suboptimal handling of long-range depende… ▽ More

    Submitted 22 April, 2025; originally announced April 2025.

  13. arXiv:2504.16382  [pdf, other

    cs.DS cs.DC

    Fully Scalable MPC Algorithms for Euclidean k-Center

    Authors: Artur Czumaj, Guichen Gao, Mohsen Ghaffari, Shaofeng H. -C. Jiang

    Abstract: The $k$-center problem is a fundamental optimization problem with numerous applications in machine learning, data analysis, data mining, and communication networks. The $k$-center problem has been extensively studied in the classical sequential setting for several decades, and more recently there have been some efforts in understanding the problem in parallel computing, on the Massively Parallel C… ▽ More

    Submitted 24 April, 2025; v1 submitted 22 April, 2025; originally announced April 2025.

  14. arXiv:2504.14692  [pdf, other

    cs.CL

    OmniV-Med: Scaling Medical Vision-Language Model for Universal Visual Understanding

    Authors: Songtao Jiang, Yuan Wang, Sibo Song, Yan Zhang, Zijie Meng, Bohan Lei, Jian Wu, Jimeng Sun, Zuozhu Liu

    Abstract: The practical deployment of medical vision-language models (Med-VLMs) necessitates seamless integration of textual data with diverse visual modalities, including 2D/3D images and videos, yet existing models typically employ separate encoders for different modalities. To address this limitation, we present OmniV-Med, a unified framework for multimodal medical understanding. Our technical contributi… ▽ More

    Submitted 20 April, 2025; originally announced April 2025.

  15. arXiv:2504.14286  [pdf, other

    cs.LG

    SRPO: A Cross-Domain Implementation of Large-Scale Reinforcement Learning on LLM

    Authors: Xiaojiang Zhang, Jinghui Wang, Zifei Cheng, Wenhao Zhuang, Zheng Lin, Minglei Zhang, Shaojie Wang, Yinghan Cui, Chao Wang, Junyi Peng, Shimiao Jiang, Shiqi Kuang, Shouyu Yin, Chaohang Wen, Haotian Zhang, Bin Chen, Bing Yu

    Abstract: Recent advances of reasoning models, exemplified by OpenAI's o1 and DeepSeek's R1, highlight the significant potential of Reinforcement Learning (RL) to enhance the reasoning capabilities of Large Language Models (LLMs). However, replicating these advancements across diverse domains remains challenging due to limited methodological transparency. In this work, we present two-Staged history-Resampli… ▽ More

    Submitted 22 April, 2025; v1 submitted 19 April, 2025; originally announced April 2025.

  16. arXiv:2504.13754  [pdf, other

    cs.CV cs.AI

    Towards Accurate and Interpretable Neuroblastoma Diagnosis via Contrastive Multi-scale Pathological Image Analysis

    Authors: Zhu Zhu, Shuo Jiang, Jingyuan Zheng, Yawen Li, Yifei Chen, Manli Zhao, Weizhong Gu, Feiwei Qin, Jinhu Wang, Gang Yu

    Abstract: Neuroblastoma, adrenal-derived, is among the most common pediatric solid malignancies, characterized by significant clinical heterogeneity. Timely and accurate pathological diagnosis from hematoxylin and eosin-stained whole-slide images is critical for patient prognosis. However, current diagnostic practices primarily rely on subjective manual examination by pathologists, leading to inconsistent a… ▽ More

    Submitted 6 May, 2025; v1 submitted 18 April, 2025; originally announced April 2025.

    Comments: 10pages, 8 figures

  17. arXiv:2504.13405  [pdf, other

    cs.CV

    ProgRoCC: A Progressive Approach to Rough Crowd Counting

    Authors: Shengqin Jiang, Linfei Li, Haokui Zhang, Qingshan Liu, Amin Beheshti, Jian Yang, Anton van den Hengel, Quan Z. Sheng, Yuankai Qi

    Abstract: As the number of individuals in a crowd grows, enumeration-based techniques become increasingly infeasible and their estimates increasingly unreliable. We propose instead an estimation-based version of the problem: we label Rough Crowd Counting that delivers better accuracy on the basis of training data that is easier to acquire. Rough crowd counting requires only rough annotations of the number o… ▽ More

    Submitted 17 April, 2025; originally announced April 2025.

    Comments: Under review

  18. arXiv:2504.11367  [pdf, other

    physics.soc-ph cs.CL

    Network Alignment

    Authors: Rui Tang, Ziyun Yong, Shuyu Jiang, Xingshu Chen, Yaofang Liu, Yi-Cheng Zhang, Gui-Quan Sun, Wei Wang

    Abstract: Complex networks are frequently employed to model physical or virtual complex systems. When certain entities exist across multiple systems simultaneously, unveiling their corresponding relationships across the networks becomes crucial. This problem, known as network alignment, holds significant importance. It enhances our understanding of complex system structures and behaviours, facilitates the v… ▽ More

    Submitted 15 April, 2025; originally announced April 2025.

    Journal ref: Physics Reports 1107 (2025): 1-45

  19. arXiv:2504.09704  [pdf, other

    cs.LG cs.AI

    Transformer-Based Representation Learning for Robust Gene Expression Modeling and Cancer Prognosis

    Authors: Shuai Jiang, Saeed Hassanpour

    Abstract: Transformer-based models have achieved remarkable success in natural language and vision tasks, but their application to gene expression analysis remains limited due to data sparsity, high dimensionality, and missing values. We present GexBERT, a transformer-based autoencoder framework for robust representation learning of gene expression data. GexBERT learns context-aware gene embeddings by pretr… ▽ More

    Submitted 13 April, 2025; originally announced April 2025.

  20. arXiv:2504.08378  [pdf, other

    cs.LG

    Scaling Up On-Device LLMs via Active-Weight Swapping Between DRAM and Flash

    Authors: Fucheng Jia, Zewen Wu, Shiqi Jiang, Huiqiang Jiang, Qianxi Zhang, Yuqing Yang, Yunxin Liu, Ju Ren, Deyu Zhang, Ting Cao

    Abstract: Large language models (LLMs) are increasingly being deployed on mobile devices, but the limited DRAM capacity constrains the deployable model size. This paper introduces ActiveFlow, the first LLM inference framework that can achieve adaptive DRAM usage for modern LLMs (not ReLU-based), enabling the scaling up of deployable model sizes. The framework is based on the novel concept of active weight D… ▽ More

    Submitted 11 April, 2025; originally announced April 2025.

  21. arXiv:2504.08296  [pdf, other

    cs.CV

    Generative AI for Film Creation: A Survey of Recent Advances

    Authors: Ruihan Zhang, Borou Yu, Jiajian Min, Yetong Xin, Zheng Wei, Juncheng Nemo Shi, Mingzhen Huang, Xianghao Kong, Nix Liu Xin, Shanshan Jiang, Praagya Bahuguna, Mark Chan, Khushi Hora, Lijian Yang, Yongqi Liang, Runhe Bian, Yunlei Liu, Isabela Campillo Valencia, Patricia Morales Tredinick, Ilia Kozlov, Sijia Jiang, Peiwen Huang, Na Chen, Xuanxuan Liu, Anyi Rao

    Abstract: Generative AI (GenAI) is transforming filmmaking, equipping artists with tools like text-to-image and image-to-video diffusion, neural radiance fields, avatar generation, and 3D synthesis. This paper examines the adoption of these technologies in filmmaking, analyzing workflows from recent AI-driven films to understand how GenAI contributes to character creation, aesthetic styling, and narration.… ▽ More

    Submitted 11 April, 2025; originally announced April 2025.

    Comments: Accepted at CVPR 2025 CVEU workshop: AI for Creative Visual Content Generation Editing and Understanding

  22. arXiv:2504.04770  [pdf, other

    cs.LG cs.AI q-bio.MN

    Bidirectional Hierarchical Protein Multi-Modal Representation Learning

    Authors: Xuefeng Liu, Songhao Jiang, Chih-chan Tien, Jinbo Xu, Rick Stevens

    Abstract: Protein representation learning is critical for numerous biological tasks. Recently, large transformer-based protein language models (pLMs) pretrained on large scale protein sequences have demonstrated significant success in sequence-based tasks. However, pLMs lack structural information. Conversely, graph neural networks (GNNs) designed to leverage 3D structural information have shown promising g… ▽ More

    Submitted 7 April, 2025; originally announced April 2025.

  23. arXiv:2504.03513  [pdf, ps, other

    cs.DS

    Local Search for Clustering in Almost-linear Time

    Authors: Shaofeng H. -C. Jiang, Yaonan Jin, Jianing Lou, Pinyan Lu

    Abstract: We propose the first \emph{local search} algorithm for Euclidean clustering that attains an $O(1)$-approximation in almost-linear time. Specifically, for Euclidean $k$-Means, our algorithm achieves an $O(c)$-approximation in $\tilde{O}(n^{1 + 1 / c})$ time, for any constant $c \ge 1$, maintaining the same running time as the previous (non-local-search-based) approach [la Tour and Saulpic, arXiv'24… ▽ More

    Submitted 4 April, 2025; originally announced April 2025.

  24. arXiv:2504.00526  [pdf, other

    cs.CV cs.AI

    High-Quality Pseudo-Label Generation Based on Visual Prompt Assisted Cloud Model Update

    Authors: Xinrun Xu, Qiuhong Zhang, Jianwen Yang, Zhanbiao Lian, Jin Yan, Zhiming Ding, Shan Jiang

    Abstract: Generating high-quality pseudo-labels on the cloud is crucial for cloud-edge object detection, especially in dynamic traffic monitoring where data distributions evolve. Existing methods often assume reliable cloud models, neglecting potential errors or struggling with complex distribution shifts. This paper proposes Cloud-Adaptive High-Quality Pseudo-label generation (CA-HQP), addressing these lim… ▽ More

    Submitted 1 April, 2025; originally announced April 2025.

    Comments: IJCNN'25

  25. arXiv:2503.17096  [pdf, other

    cs.CV

    Multi-modal Multi-platform Person Re-Identification: Benchmark and Method

    Authors: Ruiyang Ha, Songyi Jiang, Bin Li, Bikang Pan, Yihang Zhu, Junjie Zhang, Xiatian Zhu, Shaogang Gong, Jingya Wang

    Abstract: Conventional person re-identification (ReID) research is often limited to single-modality sensor data from static cameras, which fails to address the complexities of real-world scenarios where multi-modal signals are increasingly prevalent. For instance, consider an urban ReID system integrating stationary RGB cameras, nighttime infrared sensors, and UAVs equipped with dynamic tracking capabilitie… ▽ More

    Submitted 23 March, 2025; v1 submitted 21 March, 2025; originally announced March 2025.

  26. arXiv:2503.16547  [pdf, other

    cs.AI cs.MA

    Empowering Medical Multi-Agents with Clinical Consultation Flow for Dynamic Diagnosis

    Authors: Sihan Wang, Suiyang Jiang, Yibo Gao, Boming Wang, Shangqi Gao, Xiahai Zhuang

    Abstract: Traditional AI-based healthcare systems often rely on single-modal data, limiting diagnostic accuracy due to incomplete information. However, recent advancements in foundation models show promising potential for enhancing diagnosis combining multi-modal information. While these models excel in static tasks, they struggle with dynamic diagnosis, failing to manage multi-turn interactions and often m… ▽ More

    Submitted 19 March, 2025; originally announced March 2025.

  27. arXiv:2503.16542  [pdf, other

    cs.CV cs.LG

    Defending Against Gradient Inversion Attacks for Biomedical Images via Learnable Data Perturbation

    Authors: Shiyi Jiang, Farshad Firouzi, Krishnendu Chakrabarty

    Abstract: The increasing need for sharing healthcare data and collaborating on clinical research has raised privacy concerns. Health information leakage due to malicious attacks can lead to serious problems such as misdiagnoses and patient identification issues. Privacy-preserving machine learning (PPML) and privacy-enhancing technologies, particularly federated learning (FL), have emerged in recent years a… ▽ More

    Submitted 18 March, 2025; originally announced March 2025.

  28. arXiv:2503.16539  [pdf, other

    math.OC cs.RO

    A Digital Twin Simulator of a Pastillation Process with Applications to Automatic Control based on Computer Vision

    Authors: Leonardo D. González, Joshua L. Pulsipher, Shengli Jiang, Tyler Soderstrom, Victor M. Zavala

    Abstract: We present a digital-twin simulator for a pastillation process. The simulation framework produces realistic thermal image data of the process that is used to train computer vision-based soft sensors based on convolutional neural networks (CNNs); the soft sensors produce output signals for temperature and product flow rate that enable real-time monitoring and feedback control. Pastillation technolo… ▽ More

    Submitted 18 March, 2025; originally announced March 2025.

  29. arXiv:2503.15937  [pdf, other

    cs.AI

    Advancing Mobile GUI Agents: A Verifier-Driven Approach to Practical Deployment

    Authors: Gaole Dai, Shiqi Jiang, Ting Cao, Yuanchun Li, Yuqing Yang, Rui Tan, Mo Li, Lili Qiu

    Abstract: We propose V-Droid, a mobile GUI task automation agent. Unlike previous mobile agents that utilize Large Language Models (LLMs) as generators to directly generate actions at each step, V-Droid employs LLMs as verifiers to evaluate candidate actions before making final decisions. To realize this novel paradigm, we introduce a comprehensive framework for constructing verifier-driven mobile agents: t… ▽ More

    Submitted 20 March, 2025; v1 submitted 20 March, 2025; originally announced March 2025.

    Comments: 14 pages, 4 iterations, refine figs

  30. arXiv:2503.15478  [pdf, other

    cs.LG

    SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks

    Authors: Yifei Zhou, Song Jiang, Yuandong Tian, Jason Weston, Sergey Levine, Sainbayar Sukhbaatar, Xian Li

    Abstract: Large language model (LLM) agents need to perform multi-turn interactions in real-world tasks. However, existing multi-turn RL algorithms for optimizing LLM agents fail to perform effective credit assignment over multiple turns while leveraging the generalization capabilities of LLMs and it remains unclear how to develop such algorithms. To study this, we first introduce a new benchmark, ColBench,… ▽ More

    Submitted 19 March, 2025; originally announced March 2025.

    Comments: 29 pages, 16 figures

  31. arXiv:2503.15301  [pdf, other

    cs.SE

    aiXcoder-7B-v2: Training LLMs to Fully Utilize the Long Context in Repository-level Code Completion

    Authors: Jia Li, Hao Zhu, Huanyu Liu, Xianjie Shi, He Zong, Yihong Dong, Kechi Zhang, Siyuan Jiang, Zhi Jin, Ge Li

    Abstract: Repository-level code completion aims to complete code based on the long contexts of the repository. Existing studies extract long contexts from the repository as inputs and leverage Large Language Models (LLMs) to generate code. However, we reveal a severe limitation of LLMs, i.e., LLMs may ignore the information within long contexts in code completion. In other words, even the contexts contain u… ▽ More

    Submitted 19 March, 2025; originally announced March 2025.

  32. arXiv:2503.11646  [pdf, other

    cs.RO

    Adversarial Data Collection: Human-Collaborative Perturbations for Efficient and Robust Robotic Imitation Learning

    Authors: Siyuan Huang, Yue Liao, Siyuan Feng, Shu Jiang, Si Liu, Hongsheng Li, Maoqing Yao, Guanghui Ren

    Abstract: The pursuit of data efficiency, where quality outweighs quantity, has emerged as a cornerstone in robotic manipulation, especially given the high costs associated with real-world data collection. We propose that maximizing the informational density of individual demonstrations can dramatically reduce reliance on large-scale datasets while improving task performance. To this end, we introduce Adver… ▽ More

    Submitted 14 March, 2025; originally announced March 2025.

    Comments: More information can be found on our project page:https://sites.google.com/view/adc-robot

  33. arXiv:2503.09951  [pdf, other

    cs.CV

    Target-aware Bidirectional Fusion Transformer for Aerial Object Tracking

    Authors: Xinglong Sun, Haijiang Sun, Shan Jiang, Jiacheng Wang, Jiasong Wang

    Abstract: The trackers based on lightweight neural networks have achieved great success in the field of aerial remote sensing, most of which aggregate multi-stage deep features to lift the tracking quality. However, existing algorithms usually only generate single-stage fusion features for state decision, which ignore that diverse kinds of features are required for identifying and locating the object, limit… ▽ More

    Submitted 12 March, 2025; originally announced March 2025.

  34. arXiv:2503.08750  [pdf, other

    cs.CL cs.AI cs.IR

    Exposing Product Bias in LLM Investment Recommendation

    Authors: Yuhan Zhi, Xiaoyu Zhang, Longtian Wang, Shumin Jiang, Shiqing Ma, Xiaohong Guan, Chao Shen

    Abstract: Large language models (LLMs), as a new generation of recommendation engines, possess powerful summarization and data analysis capabilities, surpassing traditional recommendation systems in both scope and performance. One promising application is investment recommendation. In this paper, we reveal a novel product bias in LLM investment recommendation, where LLMs exhibit systematic preferences for s… ▽ More

    Submitted 11 March, 2025; originally announced March 2025.

  35. arXiv:2503.08162  [pdf, other

    cs.RO cs.CL

    FASIONAD++ : Integrating High-Level Instruction and Information Bottleneck in FAt-Slow fusION Systems for Enhanced Safety in Autonomous Driving with Adaptive Feedback

    Authors: Kangan Qian, Ziang Luo, Sicong Jiang, Zilin Huang, Jinyu Miao, Zhikun Ma, Tianze Zhu, Jiayin Li, Yangfan He, Zheng Fu, Yining Shi, Boyue Wang, Hezhe Lin, Ziyu Chen, Jiangbo Yu, Xinyu Jiao, Mengmeng Yang, Kun Jiang, Diange Yang

    Abstract: Ensuring safe, comfortable, and efficient planning is crucial for autonomous driving systems. While end-to-end models trained on large datasets perform well in standard driving scenarios, they struggle with complex low-frequency events. Recent Large Language Models (LLMs) and Vision Language Models (VLMs) advancements offer enhanced reasoning but suffer from computational inefficiency. Inspired by… ▽ More

    Submitted 11 March, 2025; originally announced March 2025.

    Comments: 8 pages, 4 figures

  36. arXiv:2503.06669  [pdf, other

    cs.RO cs.CV cs.LG

    AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

    Authors: AgiBot-World-Contributors, Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, Shu Jiang, Yuxin Jiang, Cheng Jing, Hongyang Li, Jialu Li, Chiming Liu, Yi Liu, Yuxiang Lu, Jianlan Luo, Ping Luo, Yao Mu, Yuehan Niu, Yixuan Pan, Jiangmiao Pang , et al. (27 additional authors not shown)

    Abstract: We explore how scalable robot data can address real-world challenges for generalized robotic manipulation. Introducing AgiBot World, a large-scale platform comprising over 1 million trajectories across 217 tasks in five deployment scenarios, we achieve an order-of-magnitude increase in data scale compared to existing datasets. Accelerated by a standardized collection pipeline with human-in-the-loo… ▽ More

    Submitted 30 April, 2025; v1 submitted 9 March, 2025; originally announced March 2025.

    Comments: Project website: https://agibot-world.com/. Github repo: https://github.com/OpenDriveLab/AgiBot-World. The author list is ordered alphabetically by surname, with detailed contributions provided in the appendix

  37. arXiv:2503.06220  [pdf, other

    cs.CV cs.LG

    StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition

    Authors: Xin Ding, Hao Wu, Yifan Yang, Shiqi Jiang, Donglin Bai, Zhibo Chen, Ting Cao

    Abstract: With the rise of real-world human-AI interaction applications, such as AI assistants, the need for Streaming Video Dialogue is critical. To address this need, we introduce StreamMind, a video LLM framework that achieves ultra-FPS streaming video processing (100 fps on a single A100) and enables proactive, always-on responses in real time, without explicit user intervention. To solve the key chal… ▽ More

    Submitted 28 March, 2025; v1 submitted 8 March, 2025; originally announced March 2025.

  38. arXiv:2503.05231  [pdf, other

    cs.RO cs.AI

    Kaiwu: A Multimodal Manipulation Dataset and Framework for Robot Learning and Human-Robot Interaction

    Authors: Shuo Jiang, Haonan Li, Ruochen Ren, Yanmin Zhou, Zhipeng Wang, Bin He

    Abstract: Cutting-edge robot learning techniques including foundation models and imitation learning from humans all pose huge demands on large-scale and high-quality datasets which constitute one of the bottleneck in the general intelligent robot fields. This paper presents the Kaiwu multimodal dataset to address the missing real-world synchronized multimodal data problems in the sophisticated assembling sc… ▽ More

    Submitted 7 March, 2025; originally announced March 2025.

  39. arXiv:2503.05173  [pdf, other

    cs.DS

    Fair Clustering in the Sliding Window Model

    Authors: Vincent Cohen-Addad, Shaofeng H. -C. Jiang, Qiaoyuan Yang, Yubo Zhang, Samson Zhou

    Abstract: We study streaming algorithms for proportionally fair clustering, a notion originally suggested by Chierichetti et. al. (2017), in the sliding window model. We show that although there exist efficient streaming algorithms in the insertion-only model, surprisingly no algorithm can achieve finite multiplicative ratio without violating the fairness constraint in the sliding window. Hence, the problem… ▽ More

    Submitted 7 March, 2025; originally announced March 2025.

    Comments: ICLR 2025

  40. arXiv:2502.17761  [pdf, other

    cs.CV stat.AP

    AI-driven 3D Spatial Transcriptomics

    Authors: Cristina Almagro-Pérez, Andrew H. Song, Luca Weishaupt, Ahrong Kim, Guillaume Jaume, Drew F. K. Williamson, Konstantin Hemker, Ming Y. Lu, Kritika Singh, Bowen Chen, Long Phi Le, Alexander S. Baras, Sizun Jiang, Ali Bashashati, Jonathan T. C. Liu, Faisal Mahmood

    Abstract: A comprehensive three-dimensional (3D) map of tissue architecture and gene expression is crucial for illuminating the complexity and heterogeneity of tissues across diverse biomedical applications. However, most spatial transcriptomics (ST) approaches remain limited to two-dimensional (2D) sections of tissue. Although current 3D ST methods hold promise, they typically require extensive tissue sect… ▽ More

    Submitted 24 February, 2025; originally announced February 2025.

  41. arXiv:2502.17494  [pdf, other

    cs.IR cs.AI cs.LG

    External Large Foundation Model: How to Efficiently Serve Trillions of Parameters for Online Ads Recommendation

    Authors: Mingfu Liang, Xi Liu, Rong Jin, Boyang Liu, Qiuling Suo, Qinghai Zhou, Song Zhou, Laming Chen, Hua Zheng, Zhiyuan Li, Shali Jiang, Jiyan Yang, Xiaozhen Xia, Fan Yang, Yasmine Badr, Ellie Wen, Shuyu Xu, Hansey Chen, Zhengyu Zhang, Jade Nie, Chunzhi Yang, Zhichen Zeng, Weilin Zhang, Xingliang Huang, Qianru Li , et al. (80 additional authors not shown)

    Abstract: Ads recommendation is a prominent service of online advertising systems and has been actively studied. Recent studies indicate that scaling-up and advanced design of the recommendation model can bring significant performance improvement. However, with a larger model scale, such prior studies have a significantly increasing gap from industry as they often neglect two fundamental challenges in indus… ▽ More

    Submitted 23 April, 2025; v1 submitted 20 February, 2025; originally announced February 2025.

    Comments: Accepted by the ACM Web Conference (WWW) 2025 Industrial Track as Oral Presentation

  42. arXiv:2502.16033  [pdf, other

    cs.CL cs.AI

    Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models

    Authors: Qianqi Yan, Yue Fan, Hongquan Li, Shan Jiang, Yang Zhao, Xinze Guan, Ching-Chen Kuo, Xin Eric Wang

    Abstract: Existing Multimodal Large Language Models (MLLMs) are predominantly trained and tested on consistent visual-textual inputs, leaving open the question of whether they can handle inconsistencies in real-world, layout-rich content. To bridge this gap, we propose the Multimodal Inconsistency Reasoning (MMIR) benchmark to assess MLLMs' ability to detect and reason about semantic mismatches in artifacts… ▽ More

    Submitted 4 March, 2025; v1 submitted 21 February, 2025; originally announced February 2025.

  43. arXiv:2502.15441  [pdf, other

    cs.SE

    On the Effectiveness of Large Language Models in Writing Alloy Formulas

    Authors: Yang Hong, Shan Jiang, Yulei Fu, Sarfraz Khurshid

    Abstract: Declarative specifications have a vital role to play in developing safe and dependable software systems. Writing specifications correctly, however, remains particularly challenging. This paper presents a controlled experiment on using large language models (LLMs) to write declarative formulas in the well-known language Alloy. Our use of LLMs is three-fold. One, we employ LLMs to write complete All… ▽ More

    Submitted 21 February, 2025; originally announced February 2025.

  44. arXiv:2502.15246  [pdf, other

    cs.SE

    An approach for API synthesis using large language models

    Authors: Hua Zhong, Shan Jiang, Sarfraz Khurshid

    Abstract: APIs play a pivotal role in modern software development by enabling seamless communication and integration between various systems, applications, and services. Component-based API synthesis is a form of program synthesis that constructs an API by assembling predefined components from a library. Existing API synthesis techniques typically implement dedicated search strategies over bounded spaces of… ▽ More

    Submitted 21 February, 2025; originally announced February 2025.

  45. arXiv:2502.15188  [pdf, other

    eess.IV cs.CV

    Interleaved Block-based Learned Image Compression with Feature Enhancement and Quantization Error Compensation

    Authors: Shiqi Jiang, Hui Yuan, Shuai Li, Raouf Hamzaoui, Xu Wang, Junyan Huo

    Abstract: In recent years, learned image compression (LIC) methods have achieved significant performance improvements. However, obtaining a more compact latent representation and reducing the impact of quantization errors remain key challenges in the field of LIC. To address these challenges, we propose a feature extraction module, a feature refinement module, and a feature enhancement module. Our feature e… ▽ More

    Submitted 20 February, 2025; originally announced February 2025.

  46. arXiv:2502.15174  [pdf, other

    eess.IV cs.CV

    FD-LSCIC: Frequency Decomposition-based Learned Screen Content Image Compression

    Authors: Shiqi Jiang, Hui Yuan, Shuai Li, Huanqiang Zeng, Sam Kwong

    Abstract: The learned image compression (LIC) methods have already surpassed traditional techniques in compressing natural scene (NS) images. However, directly applying these methods to screen content (SC) images, which possess distinct characteristics such as sharp edges, repetitive patterns, embedded text and graphics, yields suboptimal results. This paper addresses three key challenges in SC image compre… ▽ More

    Submitted 20 February, 2025; originally announced February 2025.

  47. arXiv:2502.14848  [pdf, other

    cs.CL

    GATE: Graph-based Adaptive Tool Evolution Across Diverse Tasks

    Authors: Jianwen Luo, Yiming Huang, Jinxiang Meng, Fangyu Lei, Shizhu He, Xiao Liu, Shanshan Jiang, Bin Dong, Jun Zhao, Kang Liu

    Abstract: Large Language Models (LLMs) have shown great promise in tool-making, yet existing frameworks often struggle to efficiently construct reliable toolsets and are limited to single-task settings. To address these challenges, we propose GATE (Graph-based Adaptive Tool Evolution), an adaptive framework that dynamically constructs and evolves a hierarchical graph of reusable tools across multiple scenar… ▽ More

    Submitted 20 February, 2025; originally announced February 2025.

    Comments: 8 pages of main text, 38 pages of appendices

    MSC Class: 68T50 ACM Class: I.2.7

  48. arXiv:2502.14739  [pdf, other

    cs.CL

    SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

    Authors: M-A-P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixin Deng, Shawn Gavin, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, Yizhi Li, Yunwen Li, David Ma, Yuansheng Ni, Haoran Que , et al. (72 additional authors not shown)

    Abstract: Large language models (LLMs) have demonstrated remarkable proficiency in mainstream academic disciplines such as mathematics, physics, and computer science. However, human knowledge encompasses over 200 specialized disciplines, far exceeding the scope of existing benchmarks. The capabilities of LLMs in many of these specialized fields-particularly in light industry, agriculture, and service-orient… ▽ More

    Submitted 28 March, 2025; v1 submitted 20 February, 2025; originally announced February 2025.

  49. arXiv:2502.13124  [pdf, other

    cs.CL

    NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions

    Authors: Weizhe Yuan, Jane Yu, Song Jiang, Karthik Padthe, Yang Li, Dong Wang, Ilia Kulikov, Kyunghyun Cho, Yuandong Tian, Jason E Weston, Xian Li

    Abstract: Scaling reasoning capabilities beyond traditional domains such as math and coding is hindered by the lack of diverse and high-quality questions. To overcome this limitation, we introduce a scalable approach for generating diverse and challenging reasoning questions, accompanied by reference answers. We present NaturalReasoning, a comprehensive dataset comprising 2.8 million questions that span mul… ▽ More

    Submitted 21 February, 2025; v1 submitted 18 February, 2025; originally announced February 2025.

    Comments: Dataset at https://huggingface.co/datasets/facebook/natural_reasoning

  50. arXiv:2502.11946  [pdf, other

    cs.CL cs.AI cs.HC cs.SD eess.AS

    Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction

    Authors: Ailin Huang, Boyong Wu, Bruce Wang, Chao Yan, Chen Hu, Chengli Feng, Fei Tian, Feiyu Shen, Jingbei Li, Mingrui Chen, Peng Liu, Ruihang Miao, Wang You, Xi Chen, Xuerui Yang, Yechang Huang, Yuxiang Zhang, Zheng Gong, Zixin Zhang, Hongyu Zhou, Jianjian Sun, Brian Li, Chengting Feng, Changyi Wan, Hanpeng Hu , et al. (120 additional authors not shown)

    Abstract: Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contribu… ▽ More

    Submitted 18 February, 2025; v1 submitted 17 February, 2025; originally announced February 2025.