Skip to main content

Showing 1–50 of 127 results for author: Tong, S

.
  1. arXiv:2506.09930  [pdf, ps, other

    cs.RO cs.CV

    From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models

    Authors: Irving Fang, Juexiao Zhang, Shengbang Tong, Chen Feng

    Abstract: One promise that Vision-Language-Action (VLA) models hold over traditional imitation learning for robotics is to leverage the broad generalization capabilities of large Vision-Language Models (VLMs) to produce versatile, "generalist" robot policies. However, current evaluations of VLAs remain insufficient. Traditional imitation learning benchmarks are unsuitable due to the lack of language instruc… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

    Comments: Under review

  2. arXiv:2506.07976  [pdf, ps, other

    cs.LG cs.AI

    Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction

    Authors: Junhong Shen, Hao Bai, Lunjun Zhang, Yifei Zhou, Amrith Setlur, Shengbang Tong, Diego Caples, Nan Jiang, Tong Zhang, Ameet Talwalkar, Aviral Kumar

    Abstract: The current paradigm of test-time scaling relies on generating long reasoning traces ("thinking" more) before producing a response. In agent problems that require interaction, this can be done by generating thinking traces before acting in the world. However, this process does not allow agents to acquire new information from the environment or adapt their behavior over time. In this work, we propo… ▽ More

    Submitted 10 June, 2025; v1 submitted 9 June, 2025; originally announced June 2025.

    Comments: Fixed typo in Figure 6 and Conclusion

  3. arXiv:2506.05276  [pdf, ps, other

    cs.LG

    How to Unlock Time Series Editing? Diffusion-Driven Approach with Multi-Grained Control

    Authors: Hao Yu, Chu Xin Cheng, Runlong Yu, Yuyang Ye, Shiwei Tong, Zhaofeng Liu, Defu Lian

    Abstract: Recent advances in time series generation have shown promise, yet controlling properties in generated sequences remains challenging. Time Series Editing (TSE) - making precise modifications while preserving temporal coherence - consider both point-level constraints and segment-level controls that current methods struggle to provide. We introduce the CocktailEdit framework to enable simultaneous, f… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

  4. arXiv:2506.01738  [pdf, ps, other

    cs.CV

    STORM: Benchmarking Visual Rating of MLLMs with a Comprehensive Ordinal Regression Dataset

    Authors: Jinhong Wang, Shuo Tong, Jian liu, Dongqi Tang, Jintai Chen, Haochao Ying, Hongxia Xu, Danny Chen, Jian Wu

    Abstract: Visual rating is an essential capability of artificial intelligence (AI) for multi-dimensional quantification of visual content, primarily applied in ordinal regression (OR) tasks such as image quality assessment, facial age estimation, and medical image grading. However, current multi-modal large language models (MLLMs) under-perform in such visual rating ability while also suffering the lack of… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

    Comments: underreview of NIPS2025 D&B track

  5. arXiv:2505.00592  [pdf, other

    cs.CV cs.LG

    Uncertainty-Aware Multi-Expert Knowledge Distillation for Imbalanced Disease Grading

    Authors: Shuo Tong, Shangde Gao, Ke Liu, Zihang Huang, Hongxia Xu, Haochao Ying, Jian Wu

    Abstract: Automatic disease image grading is a significant application of artificial intelligence for healthcare, enabling faster and more accurate patient assessments. However, domain shifts, which are exacerbated by data imbalance, introduce bias into the model, posing deployment difficulties in clinical applications. To address the problem, we propose a novel \textbf{U}ncertainty-aware \textbf{M}ulti-exp… ▽ More

    Submitted 1 May, 2025; originally announced May 2025.

  6. arXiv:2504.15280  [pdf, other

    cs.CV cs.CL

    Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs

    Authors: Chun-Hsiao Yeh, Chenyu Wang, Shengbang Tong, Ta-Ying Cheng, Ruoyu Wang, Tianzhe Chu, Yuexiang Zhai, Yubei Chen, Shenghua Gao, Yi Ma

    Abstract: Multi-view understanding, the ability to reconcile visual information across diverse viewpoints for effective navigation, manipulation, and 3D scene comprehension, is a fundamental challenge in Multi-Modal Large Language Models (MLLMs) to be used as embodied agents. While recent MLLMs have shown impressive advances in high-level reasoning and planning, they frequently fall short when confronted wi… ▽ More

    Submitted 26 April, 2025; v1 submitted 21 April, 2025; originally announced April 2025.

    Comments: Project page: https://danielchyeh.github.io/All-Angles-Bench/

  7. arXiv:2504.14891  [pdf, other

    cs.CL

    Retrieval Augmented Generation Evaluation in the Era of Large Language Models: A Comprehensive Survey

    Authors: Aoran Gan, Hao Yu, Kai Zhang, Qi Liu, Wenyu Yan, Zhenya Huang, Shiwei Tong, Guoping Hu

    Abstract: Recent advancements in Retrieval-Augmented Generation (RAG) have revolutionized natural language processing by integrating Large Language Models (LLMs) with external information retrieval, enabling accurate, up-to-date, and verifiable text generation across diverse applications. However, evaluating RAG systems presents unique challenges due to their hybrid architecture that combines retrieval and… ▽ More

    Submitted 21 April, 2025; originally announced April 2025.

    Comments: 18 pages, 5 figures

  8. arXiv:2504.05778  [pdf, other

    physics.flu-dyn physics.bio-ph physics.comp-ph

    Residual U-Net for accurate and efficient prediction of hemodynamics in two-dimensional asymmetric stenosis

    Authors: Xintong Zou, Suiyang Tong, Wenhui Peng, Qiuxiang Huang, Jianchun Wang

    Abstract: This study presents residual U-Net (U-ResNet), a deep learning surrogate model for predicting steady hemodynamic fields in two-dimensional asymmetric stenotic channels at Reynolds numbers ranging from 200 to 800. By integrating residual connections with multi-scale feature extraction, U-ResNet achieves exceptional accuracy while significantly reducing computational costs compared to computational… ▽ More

    Submitted 27 May, 2025; v1 submitted 8 April, 2025; originally announced April 2025.

  9. arXiv:2504.04801  [pdf, other

    cs.CV

    OrderChain: A General Prompting Paradigm to Improve Ordinal Understanding Ability of MLLM

    Authors: Jinhong Wang, Shuo Tong, Jian liu, Dongqi Tang, Weiqiang Wang, Wentong Li, Hongxia Xu, Danny Chen, Jintai Chen, Jian Wu

    Abstract: Despite the remarkable progress of multimodal large language models (MLLMs), they continue to face challenges in achieving competitive performance on ordinal regression (OR; a.k.a. ordinal classification). To address this issue, this paper presents OrderChain, a novel and general prompting paradigm that improves the ordinal understanding ability of MLLMs by specificity and commonality modeling. Sp… ▽ More

    Submitted 7 April, 2025; originally announced April 2025.

  10. arXiv:2504.04431  [pdf

    physics.app-ph

    Observation of Dislocation Non-Hermitian Skin Effect

    Authors: Wenquan Wu, Qicheng Zhang, Liangjun Qi, Kun Zhang, Shuaishuai Tong, Chunyin Qiu

    Abstract: The non-Hermitian skin effect (NHSE), a striking phenomenon where a large number of states accumulate toward open boundaries, has garnered significant attention in both fundamental physics and emerging applications. Recent theoretical studies unveiled a distinctive dislocation NHSE by disentangling it from the established boundary NHSE, thereby bridging the gap between topological defects and non-… ▽ More

    Submitted 6 April, 2025; originally announced April 2025.

  11. arXiv:2504.01017  [pdf, other

    cs.CV

    Scaling Language-Free Visual Representation Learning

    Authors: David Fan, Shengbang Tong, Jiachen Zhu, Koustuv Sinha, Zhuang Liu, Xinlei Chen, Michael Rabbat, Nicolas Ballas, Yann LeCun, Amir Bar, Saining Xie

    Abstract: Visual Self-Supervised Learning (SSL) currently underperforms Contrastive Language-Image Pretraining (CLIP) in multimodal settings such as Visual Question Answering (VQA). This multimodal gap is often attributed to the semantics introduced by language supervision, even though visual SSL and CLIP models are often trained on different data. In this work, we ask the question: "Do visual self-supervis… ▽ More

    Submitted 1 April, 2025; originally announced April 2025.

    Comments: Project page at https://davidfan.io/webssl/

  12. arXiv:2503.23205  [pdf, ps, other

    cs.CL cs.AI cs.DB

    Enhancing Knowledge Graph Completion with Entity Neighborhood and Relation Context

    Authors: Jianfang Chen, Kai Zhang, Aoran Gan, Shiwei Tong, Shuanghong Shen, Qi Liu

    Abstract: Knowledge Graph Completion (KGC) aims to infer missing information in Knowledge Graphs (KGs) to address their inherent incompleteness. Traditional structure-based KGC methods, while effective, face significant computational demands and scalability challenges due to the need for dense embedding learning and scoring all entities in the KG for each prediction. Recent text-based approaches using langu… ▽ More

    Submitted 29 March, 2025; originally announced March 2025.

  13. arXiv:2503.13551  [pdf, other

    cs.CL cs.AI

    Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models

    Authors: Teng Wang, Zhangyi Jiang, Zhenqi He, Shenyang Tong, Wenhan Yang, Yanan Zheng, Zeyu Li, Zifan He, Hailei Gong

    Abstract: Recent studies show that Large Language Models (LLMs) achieve strong reasoning capabilities through supervised fine-tuning or reinforcement learning. However, a key approach, the Process Reward Model (PRM), suffers from reward hacking, making it unreliable in identifying the best intermediate step. In addition, the cost of annotating reasoning processes for reward modeling is high, making large-sc… ▽ More

    Submitted 6 May, 2025; v1 submitted 16 March, 2025; originally announced March 2025.

  14. arXiv:2503.13508  [pdf

    cs.CL cs.AI cs.CY

    It is Too Many Options: Pitfalls of Multiple-Choice Questions in Generative AI and Medical Education

    Authors: Shrutika Singh, Anton Alyakin, Daniel Alexander Alber, Jaden Stryker, Ai Phuong S Tong, Karl Sangwon, Nicolas Goff, Mathew de la Paz, Miguel Hernandez-Rovira, Ki Yun Park, Eric Claude Leuthardt, Eric Karl Oermann

    Abstract: The performance of Large Language Models (LLMs) on multiple-choice question (MCQ) benchmarks is frequently cited as proof of their medical capabilities. We hypothesized that LLM performance on medical MCQs may in part be illusory and driven by factors beyond medical content knowledge and reasoning capabilities. To assess this, we created a novel benchmark of free-response questions with paired MCQ… ▽ More

    Submitted 13 March, 2025; originally announced March 2025.

    Comments: 14 pages, 5 figures

  15. arXiv:2503.11531  [pdf, other

    cs.CY cs.AI

    Potential of large language model-powered nudges for promoting daily water and energy conservation

    Authors: Zonghan Li, Song Tong, Yi Liu, Kaiping Peng, Chunyan Wang

    Abstract: The increasing amount of pressure related to water and energy shortages has increased the urgency of cultivating individual conservation behaviors. While the concept of nudging, i.e., providing usage-based feedback, has shown promise in encouraging conservation behaviors, its efficacy is often constrained by the lack of targeted and actionable content. This study investigates the impact of the use… ▽ More

    Submitted 14 March, 2025; originally announced March 2025.

  16. arXiv:2503.01152  [pdf, other

    cs.LG cs.AI

    STGAN: Spatial-temporal Graph Autoregression Network for Pavement Distress Deterioration Prediction

    Authors: Shilin Tong, Difei Wu, Xiaona Liu, Le Zheng, Yuchuan Du, Difan Zou

    Abstract: Pavement distress significantly compromises road integrity and poses risks to drivers. Accurate prediction of pavement distress deterioration is essential for effective road management, cost reduction in maintenance, and improvement of traffic safety. However, real-world data on pavement distress is usually collected irregularly, resulting in uneven, asynchronous, and sparse spatial-temporal datas… ▽ More

    Submitted 2 March, 2025; originally announced March 2025.

    Comments: 16 pages, 16 figures, 4 tables, accepted by IEEE Transactions on Intelligent Transportation Systems (TITS)

  17. arXiv:2502.04452  [pdf, other

    astro-ph.EP

    Compact protoplanetary discs can be produced by dead zones

    Authors: Simin Tong, Richard Alexander

    Abstract: Radially compact protoplanetary discs (<=50 au) are ubiquitous in nearby star-forming regions. Multiple mechanisms have been invoked to interpret various compact discs. In this paper, we propose that fragmentation of fragile dust grains in moderate turbulence, as expected beyond the dead zone, provides an effective alternative mechanism to form compact discs which are consistent with current obser… ▽ More

    Submitted 6 February, 2025; originally announced February 2025.

    Comments: 17 pages, 14+1 figures. Accepted for publication in MNRAS

  18. arXiv:2501.17161  [pdf, other

    cs.AI cs.CV cs.LG

    SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

    Authors: Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V. Le, Sergey Levine, Yi Ma

    Abstract: Supervised fine-tuning (SFT) and reinforcement learning (RL) are widely used post-training techniques for foundation models. However, their roles in enhancing model generalization capabilities remain unclear. This paper studies the difference between SFT and RL on generalization and memorization, focusing on text-based rule variants and visual variants. We introduce GeneralPoints, an arithmetic re… ▽ More

    Submitted 26 May, 2025; v1 submitted 28 January, 2025; originally announced January 2025.

    Comments: Website at https://tianzhechu.com/SFTvsRL

  19. arXiv:2501.09732  [pdf, other

    cs.CV

    Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps

    Authors: Nanye Ma, Shangyuan Tong, Haolin Jia, Hexiang Hu, Yu-Chuan Su, Mingda Zhang, Xuan Yang, Yandong Li, Tommi Jaakkola, Xuhui Jia, Saining Xie

    Abstract: Generative models have made significant impacts across various domains, largely due to their ability to scale during training by increasing data, computational resources, and model size, a phenomenon characterized by the scaling laws. Recent research has begun to explore inference-time scaling behavior in Large Language Models (LLMs), revealing how performance can further improve with additional c… ▽ More

    Submitted 16 January, 2025; originally announced January 2025.

  20. arXiv:2501.05075  [pdf, other

    cs.AI cs.LG

    A Text-Based Knowledge-Embedded Soft Sensing Modeling Approach for General Industrial Process Tasks Based on Large Language Model

    Authors: Shuo Tong, Han Liu, Runyuan Guo, Xueqiong Tian, Wenqing Wang, Ding Liu, Youmin Zhang

    Abstract: Data-driven soft sensors (DDSS) have become mainstream methods for predicting key performance indicators in process industries. However, DDSS development requires complex and costly customized designs tailored to various tasks during the modeling process. Moreover, DDSS are constrained to a single structured data modality, limiting their ability to incorporate additional contextual knowledge. Furt… ▽ More

    Submitted 9 January, 2025; originally announced January 2025.

  21. arXiv:2501.03295  [pdf

    cs.LG cs.AI eess.SP

    A Soft Sensor Method with Uncertainty-Awareness and Self-Explanation Based on Large Language Models Enhanced by Domain Knowledge Retrieval

    Authors: Shuo Tong, Han Liu, Runyuan Guo, Wenqing Wang, Xueqiong Tian, Lingyun Wei, Lin Zhang, Huayong Wu, Ding Liu, Youmin Zhang

    Abstract: Data-driven soft sensors are crucial in predicting key performance indicators in industrial systems. However, current methods predominantly rely on the supervised learning paradigms of parameter updating, which inherently faces challenges such as high development costs, poor robustness, training instability, and lack of interpretability. Recently, large language models (LLMs) have demonstrated sig… ▽ More

    Submitted 7 January, 2025; v1 submitted 6 January, 2025; originally announced January 2025.

  22. arXiv:2412.14164  [pdf, other

    cs.CV

    MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

    Authors: Shengbang Tong, David Fan, Jiachen Zhu, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, Zhuang Liu

    Abstract: In this work, we propose Visual-Predictive Instruction Tuning (VPiT) - a simple and effective extension to visual instruction tuning that enables a pretrained LLM to quickly morph into an unified autoregressive model capable of generating both text and visual tokens. VPiT teaches an LLM to predict discrete text tokens and continuous visual tokens from any input sequence of image and text data cura… ▽ More

    Submitted 18 December, 2024; originally announced December 2024.

    Comments: Project page at tsb0601.github.io/metamorph

  23. arXiv:2412.01711  [pdf, other

    cs.CL

    Towards Resource Efficient and Interpretable Bias Mitigation in Large Language Models

    Authors: Schrasing Tong, Eliott Zemour, Rawisara Lohanimit, Lalana Kagal

    Abstract: Although large language models (LLMs) have demonstrated their effectiveness in a wide range of applications, they have also been observed to perpetuate unwanted biases present in the training data, potentially leading to harm for marginalized communities. In this paper, we mitigate bias by leveraging small biased and anti-biased expert models to obtain a debiasing signal that will be added to the… ▽ More

    Submitted 2 December, 2024; originally announced December 2024.

    Comments: 38th Conference on Neural Information Processing Systems (NeurIPS 2024) Safe Generative AI Workshop

  24. arXiv:2411.16039  [pdf

    physics.med-ph

    Label-Free Intraoperative Mean-Transition-Time Image Generation Using Statistical Gating and Deep Learning

    Authors: Yan Shi, Denghui Zhao, Jingyi Yu, Wei Ni, Pengcheng Li, Yun Gu, Peng Miao, Shanbao Tong

    Abstract: It is of paramount importance to visualize blood dynamics intraoperatively, as this enables the accurate diagnosis of intraoperative conditions and facilitates informed surgical decision-making. Indocyanine green (ICG) fluorescence imaging represents the gold standard for the assessment of blood flow and the identification of vascular structures. However, it has several disadvantages, including ti… ▽ More

    Submitted 24 November, 2024; originally announced November 2024.

  25. arXiv:2410.19560  [pdf, other

    cs.CV cs.AI cs.LG eess.IV eess.SP

    Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised Learning

    Authors: Shentong Mo, Shengbang Tong

    Abstract: In recent advancements in unsupervised visual representation learning, the Joint-Embedding Predictive Architecture (JEPA) has emerged as a significant method for extracting visual features from unlabeled imagery through an innovative masking strategy. Despite its success, two primary limitations have been identified: the inefficacy of Exponential Moving Average (EMA) from I-JEPA in preventing enti… ▽ More

    Submitted 25 October, 2024; originally announced October 2024.

  26. arXiv:2410.13669  [pdf

    q-bio.NC

    Theta and/or alpha? Neural oscillational substrates for dynamic inter-brain synchrony during mother-child cooperation

    Authors: Jiayang Xu, Yamin Li, Ruxin Su, Saishuang Wu, Chengcheng Wu, Haiwa Wang, Qi Zhu, Yue Fang, Fan Jiang, Shanbao Tong, Yunting Zhang, Xiaoli Guo

    Abstract: Mother-child interaction is a highly dynamic process neurally characterized by inter-brain synchrony (IBS) at θ and/or α rhythms. However, their establishment, dynamic changes, and roles in mother-child interactions remain unknown. Through dynamic analysis of dual-EEG from 40 mother-child dyads during turn-taking cooperation, we uncover that θ-IBS and α-IBS alternated with interactive behaviors, w… ▽ More

    Submitted 30 October, 2024; v1 submitted 17 October, 2024; originally announced October 2024.

    Comments: 27 Pages,6 figures

  27. arXiv:2410.00269  [pdf, other

    hep-ex hep-ph

    Learning to Reconstruct Quirky Tracks

    Authors: Qiyu Sha, Daniel Murnane, Max Fieg, Shelley Tong, Mark Zakharyan, Yaquan Fang, Daniel Whiteson

    Abstract: Analysis of data from particle physics experiments traditionally sacrifices some sensitivity to new particles for the sake of practical computability, effectively ignoring some potentially striking signatures. However, recent advances in ML-based tracking allow for new inroads into previously inaccessible territory, such as reconstruction of tracks which do not follow helical trajectories. This pa… ▽ More

    Submitted 30 September, 2024; originally announced October 2024.

  28. arXiv:2409.06184  [pdf, other

    math.OC cs.GT math.AP math.NA

    A Policy Iteration Method for Inverse Mean Field Games

    Authors: Kui Ren, Nathan Soedjak, Shanyin Tong

    Abstract: We propose a policy iteration method to solve an inverse problem for a mean-field game (MFG) model, specifically to reconstruct the obstacle function in the game from the partial observation data of value functions, which represent the optimal costs for agents. The proposed approach decouples this complex inverse problem, which is an optimization problem constrained by a coupled nonlinear forward… ▽ More

    Submitted 15 April, 2025; v1 submitted 9 September, 2024; originally announced September 2024.

    MSC Class: 35Q89; 35R30; 49L12; 49M41; 49N45 49N80; 65K10; 91A16

  29. arXiv:2409.02813  [pdf, other

    cs.CL cs.CV

    MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

    Authors: Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, Graham Neubig

    Abstract: This paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark. MMMU-Pro rigorously assesses multimodal models' true understanding and reasoning capabilities through a three-step process based on MMMU: (1) filtering out questions answerable by text-only models, (2) augmenting candidate options, and (3) introducing a vision-o… ▽ More

    Submitted 22 May, 2025; v1 submitted 4 September, 2024; originally announced September 2024.

    Comments: ACL 2025 Main

  30. arXiv:2408.03496  [pdf, other

    math.NA math.OC physics.comp-ph physics.med-ph

    A three-stage method for reconstructing multiple coefficients in coupled photoacoustic and diffuse optical imaging

    Authors: Yinxi Pan, Kui Ren, Shanyin Tong

    Abstract: This paper studies inverse problems in quantitative photoacoustic tomography with additional optical current data supplemented from diffuse optical tomography. We propose a three-stage image reconstruction method for the simultaneous recovery of the absorption, diffusion, and Grüneisen coefficients. We demonstrate, through numerical simulations, that: (i) when the Grüneisen coefficient is known, t… ▽ More

    Submitted 23 January, 2025; v1 submitted 6 August, 2024; originally announced August 2024.

    MSC Class: 35J47; 35R30; 49M15; 65M32; 78A46; 78A60; 78A70; 80A23; 92C55; 94A08

  31. arXiv:2407.12209  [pdf, other

    astro-ph.EP astro-ph.SR

    A question of personalities: evolution of viscous and wind-driven protoplanetary discs in the presence of dead zones

    Authors: Simin Tong, Richard Alexander, Giovanni Rosotti

    Abstract: Whether the angular momentum of protoplanetary discs is redistributed by viscosity or extracted by magnetised winds is a long-standing question. Demographic indicators, such as gas disc sizes and stellar accretion rates, have been proposed as ways of distinguishing between these two mechanisms. In this paper, we implement one-dimensional gas simulations to study the evolution of "hybrid" protoplan… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

    Comments: 23 pages, 17 figures. Accepted for publication in MNRAS

  32. arXiv:2406.16860  [pdf, other

    cs.CV

    Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

    Authors: Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Ziteng Wang, Rob Fergus, Yann LeCun, Saining Xie

    Abstract: We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach. While stronger language models can enhance multimodal capabilities, the design choices for vision components are often insufficiently explored and disconnected from visual representation learning research. This gap hinders accurate sensory grounding in real-world scenarios. Our study uses LLMs and… ▽ More

    Submitted 4 December, 2024; v1 submitted 24 June, 2024; originally announced June 2024.

    Comments: NeurIPS 2024 (Oral). Website at https://cambrian-mllm.github.io

  33. arXiv:2406.01276  [pdf, other

    cs.CL

    EduNLP: Towards a Unified and Modularized Library for Educational Resources

    Authors: Zhenya Huang, Yuting Ning, Longhu Qin, Shiwei Tong, Shangzi Xue, Tong Xiao, Xin Lin, Jiayu Liu, Qi Liu, Enhong Chen, Shijing Wang

    Abstract: Educational resource understanding is vital to online learning platforms, which have demonstrated growing applications recently. However, researchers and developers always struggle with using existing general natural language toolkits or domain-specific models. The issue raises a need to develop an effective and easy-to-use one that benefits AI education-related research and applications. To bridg… ▽ More

    Submitted 4 June, 2024; v1 submitted 3 June, 2024; originally announced June 2024.

  34. arXiv:2405.10292  [pdf, other

    cs.AI cs.CL cs.CV cs.LG

    Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning

    Authors: Yuexiang Zhai, Hao Bai, Zipeng Lin, Jiayi Pan, Shengbang Tong, Yifei Zhou, Alane Suhr, Saining Xie, Yann LeCun, Yi Ma, Sergey Levine

    Abstract: Large vision-language models (VLMs) fine-tuned on specialized visual instruction-following data have exhibited impressive language reasoning capabilities across various scenarios. However, this fine-tuning paradigm may not be able to efficiently learn optimal decision-making agents in multi-step goal-directed tasks from interactive environments. To address this challenge, we propose an algorithmic… ▽ More

    Submitted 7 October, 2024; v1 submitted 16 May, 2024; originally announced May 2024.

  35. Evaluation of Retrieval-Augmented Generation: A Survey

    Authors: Hao Yu, Aoran Gan, Kai Zhang, Shiwei Tong, Qi Liu, Zhaofeng Liu

    Abstract: Retrieval-Augmented Generation (RAG) has recently gained traction in natural language processing. Numerous studies and real-world applications are leveraging its ability to enhance generative models through external information retrieval. Evaluating these RAG systems, however, poses unique challenges due to their hybrid structure and reliance on dynamic knowledge sources. To better understand thes… ▽ More

    Submitted 3 July, 2024; v1 submitted 12 May, 2024; originally announced May 2024.

  36. arXiv:2404.13918  [pdf

    eess.SP

    Emerging Advancements in 6G NTN Radio Access Technologies: An Overview

    Authors: Husnain Shahid, Carla Amatetti, Riccardo Campana, Sorya Tong, Dorin Panaitopol, Alessandro Vanelli Coralli, Abdelhamed Mohamed, Chao Zhang, Ebraam Khalifa, Eduardo Medeiros, Estefania Recayte, Fatemeh Ghasemifard, Ji Lianghai, Juan Bucheli, Karthik Anantha Swamy, Marius Caus, Mehmet Gurelli, Miguel A. Vazquez, Musbah Shaat, Nathan Borios, Per-Erik Eriksson, Sebastian Euler, Zheng Li, Xiaotian Fu

    Abstract: The efforts on the development, standardization and improvements to communication systems towards 5G Advanced and 6G are on track to provide benefits such as an unprecedented level of connectivity and performance, enabling a diverse range of vertical services. The full integration of non-terrestrial components into 6G plays a pivotal role in realizing this paradigm shift towards ubiquitous communi… ▽ More

    Submitted 22 April, 2024; originally announced April 2024.

    Comments: accepted in 2024 EuCNC and 6G Summit, Antwerp, Belgium, 3_6 June 2024

  37. arXiv:2404.09567  [pdf, other

    eess.SY

    A competitive game optimization algorithm for Unmanned Aerial Vehicle path planning

    Authors: Tai-shan Lou, Guang-sheng Guan, Zhe-peng Yue, Yu Wang, Ren-long Qi, Shi-hao Tong

    Abstract: To solve the Unmanned Aerial Vehicle (UAV) path planning problem, a meta-heuristic optimization algorithm called competitive game optimizer (CGO) is proposed. In the CGO model, three phases of exploration and exploitation, and candidate replacement, are established, corresponding to the player's search for supplies and combat, and the movement toward a safe zone. In the algorithm exploration phase… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

  38. arXiv:2404.05575  [pdf

    cond-mat.mtrl-sci

    Prediction of topotactic transition from black to blue phosphorus induced by surface Br adsorption

    Authors: Hao Tian, Wenjun Xie, Maohai Xie, Chuanhui Zhu, Hu Xu, Shuk-Yin Tong

    Abstract: Based on first-principles calculations, we propose a potential access to the yet unrealized freestanding blue phosphorus (blueP) through transformation of black phosphorus (blackP) induced by surface bromine (Br) adsorption. Formation of the Br-P bonds disrupts the original sp3 configurations in blackP, generates unpaired pz electrons and induces a structural transformation that results in blueP f… ▽ More

    Submitted 8 April, 2024; originally announced April 2024.

  39. arXiv:2403.10953  [pdf, other

    cs.CV

    Ctrl123: Consistent Novel View Synthesis via Closed-Loop Transcription

    Authors: Hongxiang Zhao, Xili Dai, Jianan Wang, Shengbang Tong, Jingyuan Zhang, Weida Wang, Lei Zhang, Yi Ma

    Abstract: Large image diffusion models have demonstrated zero-shot capability in novel view synthesis (NVS). However, existing diffusion-based NVS methods struggle to generate novel views that are accurately consistent with the corresponding ground truth poses and appearances, even on the training set. This consequently limits the performance of downstream tasks, such as image-to-multiview generation and 3D… ▽ More

    Submitted 21 June, 2024; v1 submitted 16 March, 2024; originally announced March 2024.

  40. Automating psychological hypothesis generation with AI: when large language models meet causal graph

    Authors: Song Tong, Kai Mao, Zhen Huang, Yukun Zhao, Kaiping Peng

    Abstract: Leveraging the synergy between causal knowledge graphs and a large language model (LLM), our study introduces a groundbreaking approach for computational hypothesis generation in psychology. We analyzed 43,312 psychology articles using a LLM to extract causal relation pairs. This analysis produced a specialized causal graph for psychology. Applying link prediction algorithms, we generated 130 pote… ▽ More

    Submitted 15 July, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

    Journal ref: Humanities and Social Sciences Communications, (2024) 11:896

  41. arXiv:2402.02065  [pdf, other

    cs.LG

    Training Implicit Networks for Image Deblurring using Jacobian-Free Backpropagation

    Authors: Linghai Liu, Shuaicheng Tong, Lisa Zhao

    Abstract: Recent efforts in applying implicit networks to solve inverse problems in imaging have achieved competitive or even superior results when compared to feedforward networks. These implicit networks only require constant memory during backpropagation, regardless of the number of layers. However, they are not necessarily easy to train. Gradient calculations are computationally expensive because they r… ▽ More

    Submitted 3 February, 2024; originally announced February 2024.

  42. arXiv:2401.06209  [pdf, other

    cs.CV

    Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

    Authors: Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, Saining Xie

    Abstract: Is vision good enough for language? Recent advancements in multimodal models primarily stem from the powerful reasoning abilities of large language models (LLMs). However, the visual component typically depends only on the instance-level contrastive language-image pre-training (CLIP). Our research reveals that the visual capabilities in recent multimodal LLMs (MLLMs) still exhibit systematic short… ▽ More

    Submitted 25 April, 2024; v1 submitted 11 January, 2024; originally announced January 2024.

    Comments: Project page: https://tsb0601.github.io/mmvp_blog/

  43. arXiv:2401.01519  [pdf

    cs.LG cs.AI

    Exploring the Frontiers of LLMs in Psychological Applications: A Comprehensive Review

    Authors: Luoma Ke, Song Tong, Peng Cheng, Kaiping Peng

    Abstract: This paper explores the frontiers of large language models (LLMs) in psychology applications. Psychology has undergone several theoretical changes, and the current use of Artificial Intelligence (AI) and Machine Learning, particularly LLMs, promises to open up new research directions. We provide a detailed exploration of how LLMs like ChatGPT are transforming psychological research. It discusses t… ▽ More

    Submitted 20 April, 2025; v1 submitted 2 January, 2024; originally announced January 2024.

  44. arXiv:2312.11490  [pdf

    cond-mat.other

    Tracking Intrinsic Non-Hermitian Skin Effect in Lossy Lattices

    Authors: Liwei Xiong, Qicheng Zhang, Xiling Feng, Yufei Leng, Min Pi, Shuaishuai Tong, Chunyin Qiu

    Abstract: Non-Hermitian skin effect (NHSE), characterized by a majority of eigenstates localized at open boundaries, is one of the most iconic phenomena in non-Hermitian lattices. Despite notable experimental studies implemented, most of them witness only certain signs of the NHSE rather than the intrinsic exponential localization inherent in eigenstates, owing to the ubiquitous and inevitable background lo… ▽ More

    Submitted 1 December, 2023; originally announced December 2023.

  45. arXiv:2311.13110  [pdf, other

    cs.LG cs.CL cs.CV

    White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?

    Authors: Yaodong Yu, Sam Buchanan, Druv Pai, Tianzhe Chu, Ziyang Wu, Shengbang Tong, Hao Bai, Yuexiang Zhai, Benjamin D. Haeffele, Yi Ma

    Abstract: In this paper, we contend that a natural objective of representation learning is to compress and transform the distribution of the data, say sets of tokens, towards a low-dimensional Gaussian mixture supported on incoherent subspaces. The goodness of such a representation can be evaluated by a principled measure, called sparse rate reduction, that simultaneously maximizes the intrinsic information… ▽ More

    Submitted 6 September, 2024; v1 submitted 21 November, 2023; originally announced November 2023.

    Comments: Accepted at Journal of Machine Learning Research. This paper integrates the works arXiv:2306.01129 and arXiv:2308.16271 into a complete story. In this paper, we improve the writing and organization, and also add conceptual, empirical, and theoretical improvements over the previous work. V2: small typo fixes/formatting improvements. V3: improvements from journal revisions. V4: fix figures

  46. arXiv:2311.00121  [pdf, other

    hep-ph

    New Physics in Single Resonant Top Quarks

    Authors: Shelley Tong, James Corcoran, Max Fieg, Michael Fenton, Daniel Whiteson

    Abstract: Searches for new physics in the top quark sector are of great theoretical interest, yet some powerful avenues for discovery remain unexplored. We characterize the expected statistical power of the LHC dataset to constrain the single production of heavy top partners $T$ decaying to a top quark and a photon or a top quark and a gluon. We describe an effective interaction which could generate such pr… ▽ More

    Submitted 31 October, 2023; originally announced November 2023.

  47. Sensitivity Analysis of the Information Gain in Infinite-Dimensional Bayesian Linear Inverse Problems

    Authors: Abhijit Chowdhary, Shanyin Tong, Georg Stadler, Alen Alexanderian

    Abstract: We study the sensitivity of infinite-dimensional Bayesian linear inverse problems governed by partial differential equations (PDEs) with respect to modeling uncertainties. In particular, we consider derivative-based sensitivity analysis of the information gain, as measured by the Kullback-Leibler divergence from the posterior to the prior distribution. To facilitate this, we develop a fast and acc… ▽ More

    Submitted 16 May, 2024; v1 submitted 25 October, 2023; originally announced October 2023.

    Comments: 20 pages, 7 figures

    MSC Class: 65C60; 90C31; 62F15; 35R30; 65F55

  48. arXiv:2309.16681  [pdf, other

    cs.IT cs.AI

    Alternate Learning based Sparse Semantic Communications for Visual Transmission

    Authors: Siyu Tong, Xiaoxue Yu, Rongpeng Li, Kun Lu, Zhifeng Zhao, Honggang Zhang

    Abstract: Semantic communication (SemCom) demonstrates strong superiority over conventional bit-level accurate transmission, by only attempting to recover the essential semantic information of data. In this paper, in order to tackle the non-differentiability of channels, we propose an alternate learning based SemCom system for visual transmission, named SparseSBC. Specially, SparseSBC leverages two separate… ▽ More

    Submitted 30 July, 2023; originally announced September 2023.

  49. arXiv:2309.10313  [pdf, other

    cs.CL cs.AI cs.LG

    Investigating the Catastrophic Forgetting in Multimodal Large Language Models

    Authors: Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, Yong Jae Lee, Yi Ma

    Abstract: Following the success of GPT4, there has been a surge in interest in multimodal large language model (MLLM) research. This line of research focuses on developing general-purpose LLMs through fine-tuning pre-trained LLMs and vision models. However, catastrophic forgetting, a notorious phenomenon where the fine-tuned model fails to retain similar performance compared to the pre-trained model, still… ▽ More

    Submitted 5 December, 2023; v1 submitted 19 September, 2023; originally announced September 2023.

  50. arXiv:2309.09449  [pdf

    cs.DL

    Multi-Affiliated Authors Behave Differently across Fields and Host Country Preferences: A Comparison in G7 and BRICS

    Authors: Sichao Tong, Liying Yang

    Abstract: This paper study author simultaneously engaged in multiple affiliations based on bibliometric data covered in the Web of Science for the 2017-2021 period. Based on the affiliation information in publication records, we propose a general classification for multiple affiliations within-country or cross-country for analyzing authors' behavior in multiple affiliations and preferences of host countries… ▽ More

    Submitted 17 September, 2023; originally announced September 2023.