Skip to main content

Showing 1–50 of 79 results for author: Tong, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.05411  [pdf, ps, other

    cs.LG

    AXLearn: Modular Large Model Training on Heterogeneous Infrastructure

    Authors: Mark Lee, Tom Gunter, Chang Lan, John Peebles, Hanzhi Zhou, Kelvin Zou, Sneha Bangalore, Chung-Cheng Chiu, Nan Du, Xianzhi Du, Philipp Dufter, Ruixuan Hou, Haoshuo Huang, Dongseong Hwang, Xiang Kong, Jinhao Lei, Tao Lei, Meng Li, Li Li, Jiarui Lu, Zhiyun Lu, Yiping Ma, David Qiu, Vivek Rathod, Senyu Tong , et al. (12 additional authors not shown)

    Abstract: We design and implement AXLearn, a production deep learning system that facilitates scalable and high-performance training of large deep learning models. Compared to other state-of-the-art deep learning systems, AXLearn has a unique focus on modularity and support for heterogeneous hardware infrastructure. AXLearn's internal interfaces between software components follow strict encapsulation, allow… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

  2. arXiv:2506.14968  [pdf, ps, other

    cs.RO cs.AI

    FEAST: A Flexible Mealtime-Assistance System Towards In-the-Wild Personalization

    Authors: Rajat Kumar Jenamani, Tom Silver, Ben Dodson, Shiqin Tong, Anthony Song, Yuting Yang, Ziang Liu, Benjamin Howe, Aimee Whitneck, Tapomayukh Bhattacharjee

    Abstract: Physical caregiving robots hold promise for improving the quality of life of millions worldwide who require assistance with feeding. However, in-home meal assistance remains challenging due to the diversity of activities (e.g., eating, drinking, mouth wiping), contexts (e.g., socializing, watching TV), food items, and user preferences that arise during deployment. In this work, we propose FEAST, a… ▽ More

    Submitted 27 June, 2025; v1 submitted 17 June, 2025; originally announced June 2025.

    Comments: RSS 2025 - Best Paper Award

  3. arXiv:2506.09930  [pdf, ps, other

    cs.RO cs.CV

    From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models

    Authors: Irving Fang, Juexiao Zhang, Shengbang Tong, Chen Feng

    Abstract: One promise that Vision-Language-Action (VLA) models hold over traditional imitation learning for robotics is to leverage the broad generalization capabilities of large Vision-Language Models (VLMs) to produce versatile, "generalist" robot policies. However, current evaluations of VLAs remain insufficient. Traditional imitation learning benchmarks are unsuitable due to the lack of language instruc… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

    Comments: Under review

  4. arXiv:2506.07976  [pdf, ps, other

    cs.LG cs.AI

    Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction

    Authors: Junhong Shen, Hao Bai, Lunjun Zhang, Yifei Zhou, Amrith Setlur, Shengbang Tong, Diego Caples, Nan Jiang, Tong Zhang, Ameet Talwalkar, Aviral Kumar

    Abstract: The current paradigm of test-time scaling relies on generating long reasoning traces ("thinking" more) before producing a response. In agent problems that require interaction, this can be done by generating thinking traces before acting in the world. However, this process does not allow agents to acquire new information from the environment or adapt their behavior over time. In this work, we propo… ▽ More

    Submitted 10 June, 2025; v1 submitted 9 June, 2025; originally announced June 2025.

    Comments: Fixed typo in Figure 6 and Conclusion

  5. arXiv:2506.05276  [pdf, ps, other

    cs.LG

    How to Unlock Time Series Editing? Diffusion-Driven Approach with Multi-Grained Control

    Authors: Hao Yu, Chu Xin Cheng, Runlong Yu, Yuyang Ye, Shiwei Tong, Zhaofeng Liu, Defu Lian

    Abstract: Recent advances in time series generation have shown promise, yet controlling properties in generated sequences remains challenging. Time Series Editing (TSE) - making precise modifications while preserving temporal coherence - consider both point-level constraints and segment-level controls that current methods struggle to provide. We introduce the CocktailEdit framework to enable simultaneous, f… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

  6. arXiv:2506.01738  [pdf, ps, other

    cs.CV

    STORM: Benchmarking Visual Rating of MLLMs with a Comprehensive Ordinal Regression Dataset

    Authors: Jinhong Wang, Shuo Tong, Jian liu, Dongqi Tang, Jintai Chen, Haochao Ying, Hongxia Xu, Danny Chen, Jian Wu

    Abstract: Visual rating is an essential capability of artificial intelligence (AI) for multi-dimensional quantification of visual content, primarily applied in ordinal regression (OR) tasks such as image quality assessment, facial age estimation, and medical image grading. However, current multi-modal large language models (MLLMs) under-perform in such visual rating ability while also suffering the lack of… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

    Comments: underreview of NIPS2025 D&B track

  7. arXiv:2505.00592  [pdf, other

    cs.CV cs.LG

    Uncertainty-Aware Multi-Expert Knowledge Distillation for Imbalanced Disease Grading

    Authors: Shuo Tong, Shangde Gao, Ke Liu, Zihang Huang, Hongxia Xu, Haochao Ying, Jian Wu

    Abstract: Automatic disease image grading is a significant application of artificial intelligence for healthcare, enabling faster and more accurate patient assessments. However, domain shifts, which are exacerbated by data imbalance, introduce bias into the model, posing deployment difficulties in clinical applications. To address the problem, we propose a novel \textbf{U}ncertainty-aware \textbf{M}ulti-exp… ▽ More

    Submitted 1 May, 2025; originally announced May 2025.

  8. arXiv:2504.15280  [pdf, other

    cs.CV cs.CL

    Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs

    Authors: Chun-Hsiao Yeh, Chenyu Wang, Shengbang Tong, Ta-Ying Cheng, Ruoyu Wang, Tianzhe Chu, Yuexiang Zhai, Yubei Chen, Shenghua Gao, Yi Ma

    Abstract: Multi-view understanding, the ability to reconcile visual information across diverse viewpoints for effective navigation, manipulation, and 3D scene comprehension, is a fundamental challenge in Multi-Modal Large Language Models (MLLMs) to be used as embodied agents. While recent MLLMs have shown impressive advances in high-level reasoning and planning, they frequently fall short when confronted wi… ▽ More

    Submitted 26 April, 2025; v1 submitted 21 April, 2025; originally announced April 2025.

    Comments: Project page: https://danielchyeh.github.io/All-Angles-Bench/

  9. arXiv:2504.14891  [pdf, other

    cs.CL

    Retrieval Augmented Generation Evaluation in the Era of Large Language Models: A Comprehensive Survey

    Authors: Aoran Gan, Hao Yu, Kai Zhang, Qi Liu, Wenyu Yan, Zhenya Huang, Shiwei Tong, Guoping Hu

    Abstract: Recent advancements in Retrieval-Augmented Generation (RAG) have revolutionized natural language processing by integrating Large Language Models (LLMs) with external information retrieval, enabling accurate, up-to-date, and verifiable text generation across diverse applications. However, evaluating RAG systems presents unique challenges due to their hybrid architecture that combines retrieval and… ▽ More

    Submitted 21 April, 2025; originally announced April 2025.

    Comments: 18 pages, 5 figures

  10. arXiv:2504.04801  [pdf, ps, other

    cs.CV

    OrderChain: Towards General Instruct-Tuning for Stimulating the Ordinal Understanding Ability of MLLM

    Authors: Jinhong Wang, Shuo Tong, Jian liu, Dongqi Tang, Weiqiang Wang, Wentong Li, Hongxia Xu, Danny Chen, Jintai Chen, Jian Wu

    Abstract: Despite the remarkable progress of multimodal large language models (MLLMs), they continue to face challenges in achieving competitive performance on ordinal regression (OR; a.k.a. ordinal classification). To address this issue, this paper presents OrderChain, a novel and general prompting paradigm that improves the ordinal understanding ability of MLLMs by specificity and commonality modeling. Sp… ▽ More

    Submitted 30 June, 2025; v1 submitted 7 April, 2025; originally announced April 2025.

  11. arXiv:2504.01017  [pdf, other

    cs.CV

    Scaling Language-Free Visual Representation Learning

    Authors: David Fan, Shengbang Tong, Jiachen Zhu, Koustuv Sinha, Zhuang Liu, Xinlei Chen, Michael Rabbat, Nicolas Ballas, Yann LeCun, Amir Bar, Saining Xie

    Abstract: Visual Self-Supervised Learning (SSL) currently underperforms Contrastive Language-Image Pretraining (CLIP) in multimodal settings such as Visual Question Answering (VQA). This multimodal gap is often attributed to the semantics introduced by language supervision, even though visual SSL and CLIP models are often trained on different data. In this work, we ask the question: "Do visual self-supervis… ▽ More

    Submitted 1 April, 2025; originally announced April 2025.

    Comments: Project page at https://davidfan.io/webssl/

  12. arXiv:2503.23205  [pdf, ps, other

    cs.CL cs.AI cs.DB

    Enhancing Knowledge Graph Completion with Entity Neighborhood and Relation Context

    Authors: Jianfang Chen, Kai Zhang, Aoran Gan, Shiwei Tong, Shuanghong Shen, Qi Liu

    Abstract: Knowledge Graph Completion (KGC) aims to infer missing information in Knowledge Graphs (KGs) to address their inherent incompleteness. Traditional structure-based KGC methods, while effective, face significant computational demands and scalability challenges due to the need for dense embedding learning and scoring all entities in the KG for each prediction. Recent text-based approaches using langu… ▽ More

    Submitted 29 March, 2025; originally announced March 2025.

  13. arXiv:2503.13551  [pdf, other

    cs.CL cs.AI

    Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models

    Authors: Teng Wang, Zhangyi Jiang, Zhenqi He, Shenyang Tong, Wenhan Yang, Yanan Zheng, Zeyu Li, Zifan He, Hailei Gong

    Abstract: Recent studies show that Large Language Models (LLMs) achieve strong reasoning capabilities through supervised fine-tuning or reinforcement learning. However, a key approach, the Process Reward Model (PRM), suffers from reward hacking, making it unreliable in identifying the best intermediate step. In addition, the cost of annotating reasoning processes for reward modeling is high, making large-sc… ▽ More

    Submitted 6 May, 2025; v1 submitted 16 March, 2025; originally announced March 2025.

  14. arXiv:2503.13508  [pdf

    cs.CL cs.AI cs.CY

    It is Too Many Options: Pitfalls of Multiple-Choice Questions in Generative AI and Medical Education

    Authors: Shrutika Singh, Anton Alyakin, Daniel Alexander Alber, Jaden Stryker, Ai Phuong S Tong, Karl Sangwon, Nicolas Goff, Mathew de la Paz, Miguel Hernandez-Rovira, Ki Yun Park, Eric Claude Leuthardt, Eric Karl Oermann

    Abstract: The performance of Large Language Models (LLMs) on multiple-choice question (MCQ) benchmarks is frequently cited as proof of their medical capabilities. We hypothesized that LLM performance on medical MCQs may in part be illusory and driven by factors beyond medical content knowledge and reasoning capabilities. To assess this, we created a novel benchmark of free-response questions with paired MCQ… ▽ More

    Submitted 13 March, 2025; originally announced March 2025.

    Comments: 14 pages, 5 figures

  15. arXiv:2503.11531  [pdf, other

    cs.CY cs.AI

    Potential of large language model-powered nudges for promoting daily water and energy conservation

    Authors: Zonghan Li, Song Tong, Yi Liu, Kaiping Peng, Chunyan Wang

    Abstract: The increasing amount of pressure related to water and energy shortages has increased the urgency of cultivating individual conservation behaviors. While the concept of nudging, i.e., providing usage-based feedback, has shown promise in encouraging conservation behaviors, its efficacy is often constrained by the lack of targeted and actionable content. This study investigates the impact of the use… ▽ More

    Submitted 14 March, 2025; originally announced March 2025.

  16. arXiv:2503.01152  [pdf, other

    cs.LG cs.AI

    STGAN: Spatial-temporal Graph Autoregression Network for Pavement Distress Deterioration Prediction

    Authors: Shilin Tong, Difei Wu, Xiaona Liu, Le Zheng, Yuchuan Du, Difan Zou

    Abstract: Pavement distress significantly compromises road integrity and poses risks to drivers. Accurate prediction of pavement distress deterioration is essential for effective road management, cost reduction in maintenance, and improvement of traffic safety. However, real-world data on pavement distress is usually collected irregularly, resulting in uneven, asynchronous, and sparse spatial-temporal datas… ▽ More

    Submitted 2 March, 2025; originally announced March 2025.

    Comments: 16 pages, 16 figures, 4 tables, accepted by IEEE Transactions on Intelligent Transportation Systems (TITS)

  17. arXiv:2501.17161  [pdf, other

    cs.AI cs.CV cs.LG

    SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

    Authors: Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V. Le, Sergey Levine, Yi Ma

    Abstract: Supervised fine-tuning (SFT) and reinforcement learning (RL) are widely used post-training techniques for foundation models. However, their roles in enhancing model generalization capabilities remain unclear. This paper studies the difference between SFT and RL on generalization and memorization, focusing on text-based rule variants and visual variants. We introduce GeneralPoints, an arithmetic re… ▽ More

    Submitted 26 May, 2025; v1 submitted 28 January, 2025; originally announced January 2025.

    Comments: Website at https://tianzhechu.com/SFTvsRL

  18. arXiv:2501.09732  [pdf, other

    cs.CV

    Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps

    Authors: Nanye Ma, Shangyuan Tong, Haolin Jia, Hexiang Hu, Yu-Chuan Su, Mingda Zhang, Xuan Yang, Yandong Li, Tommi Jaakkola, Xuhui Jia, Saining Xie

    Abstract: Generative models have made significant impacts across various domains, largely due to their ability to scale during training by increasing data, computational resources, and model size, a phenomenon characterized by the scaling laws. Recent research has begun to explore inference-time scaling behavior in Large Language Models (LLMs), revealing how performance can further improve with additional c… ▽ More

    Submitted 16 January, 2025; originally announced January 2025.

  19. arXiv:2501.05075  [pdf, other

    cs.AI cs.LG

    A Text-Based Knowledge-Embedded Soft Sensing Modeling Approach for General Industrial Process Tasks Based on Large Language Model

    Authors: Shuo Tong, Han Liu, Runyuan Guo, Xueqiong Tian, Wenqing Wang, Ding Liu, Youmin Zhang

    Abstract: Data-driven soft sensors (DDSS) have become mainstream methods for predicting key performance indicators in process industries. However, DDSS development requires complex and costly customized designs tailored to various tasks during the modeling process. Moreover, DDSS are constrained to a single structured data modality, limiting their ability to incorporate additional contextual knowledge. Furt… ▽ More

    Submitted 9 January, 2025; originally announced January 2025.

  20. arXiv:2501.03295  [pdf

    cs.LG cs.AI eess.SP

    A Soft Sensor Method with Uncertainty-Awareness and Self-Explanation Based on Large Language Models Enhanced by Domain Knowledge Retrieval

    Authors: Shuo Tong, Han Liu, Runyuan Guo, Wenqing Wang, Xueqiong Tian, Lingyun Wei, Lin Zhang, Huayong Wu, Ding Liu, Youmin Zhang

    Abstract: Data-driven soft sensors are crucial in predicting key performance indicators in industrial systems. However, current methods predominantly rely on the supervised learning paradigms of parameter updating, which inherently faces challenges such as high development costs, poor robustness, training instability, and lack of interpretability. Recently, large language models (LLMs) have demonstrated sig… ▽ More

    Submitted 7 January, 2025; v1 submitted 6 January, 2025; originally announced January 2025.

  21. arXiv:2412.14164  [pdf, other

    cs.CV

    MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

    Authors: Shengbang Tong, David Fan, Jiachen Zhu, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, Zhuang Liu

    Abstract: In this work, we propose Visual-Predictive Instruction Tuning (VPiT) - a simple and effective extension to visual instruction tuning that enables a pretrained LLM to quickly morph into an unified autoregressive model capable of generating both text and visual tokens. VPiT teaches an LLM to predict discrete text tokens and continuous visual tokens from any input sequence of image and text data cura… ▽ More

    Submitted 18 December, 2024; originally announced December 2024.

    Comments: Project page at tsb0601.github.io/metamorph

  22. arXiv:2412.01711  [pdf, other

    cs.CL

    Towards Resource Efficient and Interpretable Bias Mitigation in Large Language Models

    Authors: Schrasing Tong, Eliott Zemour, Rawisara Lohanimit, Lalana Kagal

    Abstract: Although large language models (LLMs) have demonstrated their effectiveness in a wide range of applications, they have also been observed to perpetuate unwanted biases present in the training data, potentially leading to harm for marginalized communities. In this paper, we mitigate bias by leveraging small biased and anti-biased expert models to obtain a debiasing signal that will be added to the… ▽ More

    Submitted 2 December, 2024; originally announced December 2024.

    Comments: 38th Conference on Neural Information Processing Systems (NeurIPS 2024) Safe Generative AI Workshop

  23. arXiv:2410.19560  [pdf, other

    cs.CV cs.AI cs.LG eess.IV eess.SP

    Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised Learning

    Authors: Shentong Mo, Shengbang Tong

    Abstract: In recent advancements in unsupervised visual representation learning, the Joint-Embedding Predictive Architecture (JEPA) has emerged as a significant method for extracting visual features from unlabeled imagery through an innovative masking strategy. Despite its success, two primary limitations have been identified: the inefficacy of Exponential Moving Average (EMA) from I-JEPA in preventing enti… ▽ More

    Submitted 25 October, 2024; originally announced October 2024.

  24. arXiv:2409.06184  [pdf, other

    math.OC cs.GT math.AP math.NA

    A Policy Iteration Method for Inverse Mean Field Games

    Authors: Kui Ren, Nathan Soedjak, Shanyin Tong

    Abstract: We propose a policy iteration method to solve an inverse problem for a mean-field game (MFG) model, specifically to reconstruct the obstacle function in the game from the partial observation data of value functions, which represent the optimal costs for agents. The proposed approach decouples this complex inverse problem, which is an optimization problem constrained by a coupled nonlinear forward… ▽ More

    Submitted 15 April, 2025; v1 submitted 9 September, 2024; originally announced September 2024.

    MSC Class: 35Q89; 35R30; 49L12; 49M41; 49N45 49N80; 65K10; 91A16

  25. arXiv:2409.02813  [pdf, other

    cs.CL cs.CV

    MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

    Authors: Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, Graham Neubig

    Abstract: This paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark. MMMU-Pro rigorously assesses multimodal models' true understanding and reasoning capabilities through a three-step process based on MMMU: (1) filtering out questions answerable by text-only models, (2) augmenting candidate options, and (3) introducing a vision-o… ▽ More

    Submitted 22 May, 2025; v1 submitted 4 September, 2024; originally announced September 2024.

    Comments: ACL 2025 Main

  26. arXiv:2406.16860  [pdf, other

    cs.CV

    Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

    Authors: Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Ziteng Wang, Rob Fergus, Yann LeCun, Saining Xie

    Abstract: We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach. While stronger language models can enhance multimodal capabilities, the design choices for vision components are often insufficiently explored and disconnected from visual representation learning research. This gap hinders accurate sensory grounding in real-world scenarios. Our study uses LLMs and… ▽ More

    Submitted 4 December, 2024; v1 submitted 24 June, 2024; originally announced June 2024.

    Comments: NeurIPS 2024 (Oral). Website at https://cambrian-mllm.github.io

  27. arXiv:2406.01276  [pdf, other

    cs.CL

    EduNLP: Towards a Unified and Modularized Library for Educational Resources

    Authors: Zhenya Huang, Yuting Ning, Longhu Qin, Shiwei Tong, Shangzi Xue, Tong Xiao, Xin Lin, Jiayu Liu, Qi Liu, Enhong Chen, Shijing Wang

    Abstract: Educational resource understanding is vital to online learning platforms, which have demonstrated growing applications recently. However, researchers and developers always struggle with using existing general natural language toolkits or domain-specific models. The issue raises a need to develop an effective and easy-to-use one that benefits AI education-related research and applications. To bridg… ▽ More

    Submitted 4 June, 2024; v1 submitted 3 June, 2024; originally announced June 2024.

  28. arXiv:2405.10292  [pdf, other

    cs.AI cs.CL cs.CV cs.LG

    Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning

    Authors: Yuexiang Zhai, Hao Bai, Zipeng Lin, Jiayi Pan, Shengbang Tong, Yifei Zhou, Alane Suhr, Saining Xie, Yann LeCun, Yi Ma, Sergey Levine

    Abstract: Large vision-language models (VLMs) fine-tuned on specialized visual instruction-following data have exhibited impressive language reasoning capabilities across various scenarios. However, this fine-tuning paradigm may not be able to efficiently learn optimal decision-making agents in multi-step goal-directed tasks from interactive environments. To address this challenge, we propose an algorithmic… ▽ More

    Submitted 7 October, 2024; v1 submitted 16 May, 2024; originally announced May 2024.

  29. Evaluation of Retrieval-Augmented Generation: A Survey

    Authors: Hao Yu, Aoran Gan, Kai Zhang, Shiwei Tong, Qi Liu, Zhaofeng Liu

    Abstract: Retrieval-Augmented Generation (RAG) has recently gained traction in natural language processing. Numerous studies and real-world applications are leveraging its ability to enhance generative models through external information retrieval. Evaluating these RAG systems, however, poses unique challenges due to their hybrid structure and reliance on dynamic knowledge sources. To better understand thes… ▽ More

    Submitted 3 July, 2024; v1 submitted 12 May, 2024; originally announced May 2024.

  30. arXiv:2403.10953  [pdf, other

    cs.CV

    Ctrl123: Consistent Novel View Synthesis via Closed-Loop Transcription

    Authors: Hongxiang Zhao, Xili Dai, Jianan Wang, Shengbang Tong, Jingyuan Zhang, Weida Wang, Lei Zhang, Yi Ma

    Abstract: Large image diffusion models have demonstrated zero-shot capability in novel view synthesis (NVS). However, existing diffusion-based NVS methods struggle to generate novel views that are accurately consistent with the corresponding ground truth poses and appearances, even on the training set. This consequently limits the performance of downstream tasks, such as image-to-multiview generation and 3D… ▽ More

    Submitted 21 June, 2024; v1 submitted 16 March, 2024; originally announced March 2024.

  31. Automating psychological hypothesis generation with AI: when large language models meet causal graph

    Authors: Song Tong, Kai Mao, Zhen Huang, Yukun Zhao, Kaiping Peng

    Abstract: Leveraging the synergy between causal knowledge graphs and a large language model (LLM), our study introduces a groundbreaking approach for computational hypothesis generation in psychology. We analyzed 43,312 psychology articles using a LLM to extract causal relation pairs. This analysis produced a specialized causal graph for psychology. Applying link prediction algorithms, we generated 130 pote… ▽ More

    Submitted 15 July, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

    Journal ref: Humanities and Social Sciences Communications, (2024) 11:896

  32. arXiv:2402.02065  [pdf, other

    cs.LG

    Training Implicit Networks for Image Deblurring using Jacobian-Free Backpropagation

    Authors: Linghai Liu, Shuaicheng Tong, Lisa Zhao

    Abstract: Recent efforts in applying implicit networks to solve inverse problems in imaging have achieved competitive or even superior results when compared to feedforward networks. These implicit networks only require constant memory during backpropagation, regardless of the number of layers. However, they are not necessarily easy to train. Gradient calculations are computationally expensive because they r… ▽ More

    Submitted 3 February, 2024; originally announced February 2024.

  33. arXiv:2401.06209  [pdf, other

    cs.CV

    Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

    Authors: Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, Saining Xie

    Abstract: Is vision good enough for language? Recent advancements in multimodal models primarily stem from the powerful reasoning abilities of large language models (LLMs). However, the visual component typically depends only on the instance-level contrastive language-image pre-training (CLIP). Our research reveals that the visual capabilities in recent multimodal LLMs (MLLMs) still exhibit systematic short… ▽ More

    Submitted 25 April, 2024; v1 submitted 11 January, 2024; originally announced January 2024.

    Comments: Project page: https://tsb0601.github.io/mmvp_blog/

  34. arXiv:2401.01519  [pdf

    cs.LG cs.AI

    Exploring the Frontiers of LLMs in Psychological Applications: A Comprehensive Review

    Authors: Luoma Ke, Song Tong, Peng Cheng, Kaiping Peng

    Abstract: This paper explores the frontiers of large language models (LLMs) in psychology applications. Psychology has undergone several theoretical changes, and the current use of Artificial Intelligence (AI) and Machine Learning, particularly LLMs, promises to open up new research directions. We provide a detailed exploration of how LLMs like ChatGPT are transforming psychological research. It discusses t… ▽ More

    Submitted 20 April, 2025; v1 submitted 2 January, 2024; originally announced January 2024.

  35. arXiv:2311.13110  [pdf, other

    cs.LG cs.CL cs.CV

    White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?

    Authors: Yaodong Yu, Sam Buchanan, Druv Pai, Tianzhe Chu, Ziyang Wu, Shengbang Tong, Hao Bai, Yuexiang Zhai, Benjamin D. Haeffele, Yi Ma

    Abstract: In this paper, we contend that a natural objective of representation learning is to compress and transform the distribution of the data, say sets of tokens, towards a low-dimensional Gaussian mixture supported on incoherent subspaces. The goodness of such a representation can be evaluated by a principled measure, called sparse rate reduction, that simultaneously maximizes the intrinsic information… ▽ More

    Submitted 6 September, 2024; v1 submitted 21 November, 2023; originally announced November 2023.

    Comments: Accepted at Journal of Machine Learning Research. This paper integrates the works arXiv:2306.01129 and arXiv:2308.16271 into a complete story. In this paper, we improve the writing and organization, and also add conceptual, empirical, and theoretical improvements over the previous work. V2: small typo fixes/formatting improvements. V3: improvements from journal revisions. V4: fix figures

  36. arXiv:2309.16681  [pdf, other

    cs.IT cs.AI

    Alternate Learning based Sparse Semantic Communications for Visual Transmission

    Authors: Siyu Tong, Xiaoxue Yu, Rongpeng Li, Kun Lu, Zhifeng Zhao, Honggang Zhang

    Abstract: Semantic communication (SemCom) demonstrates strong superiority over conventional bit-level accurate transmission, by only attempting to recover the essential semantic information of data. In this paper, in order to tackle the non-differentiability of channels, we propose an alternate learning based SemCom system for visual transmission, named SparseSBC. Specially, SparseSBC leverages two separate… ▽ More

    Submitted 30 July, 2023; originally announced September 2023.

  37. arXiv:2309.10313  [pdf, other

    cs.CL cs.AI cs.LG

    Investigating the Catastrophic Forgetting in Multimodal Large Language Models

    Authors: Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, Yong Jae Lee, Yi Ma

    Abstract: Following the success of GPT4, there has been a surge in interest in multimodal large language model (MLLM) research. This line of research focuses on developing general-purpose LLMs through fine-tuning pre-trained LLMs and vision models. However, catastrophic forgetting, a notorious phenomenon where the fine-tuned model fails to retain similar performance compared to the pre-trained model, still… ▽ More

    Submitted 5 December, 2023; v1 submitted 19 September, 2023; originally announced September 2023.

  38. arXiv:2309.09449  [pdf

    cs.DL

    Multi-Affiliated Authors Behave Differently across Fields and Host Country Preferences: A Comparison in G7 and BRICS

    Authors: Sichao Tong, Liying Yang

    Abstract: This paper study author simultaneously engaged in multiple affiliations based on bibliometric data covered in the Web of Science for the 2017-2021 period. Based on the affiliation information in publication records, we propose a general classification for multiple affiliations within-country or cross-country for analyzing authors' behavior in multiple affiliations and preferences of host countries… ▽ More

    Submitted 17 September, 2023; originally announced September 2023.

  39. arXiv:2308.16271  [pdf, other

    cs.CV cs.LG

    Emergence of Segmentation with Minimalistic White-Box Transformers

    Authors: Yaodong Yu, Tianzhe Chu, Shengbang Tong, Ziyang Wu, Druv Pai, Sam Buchanan, Yi Ma

    Abstract: Transformer-like models for vision tasks have recently proven effective for a wide range of downstream applications such as segmentation and detection. Previous works have shown that segmentation properties emerge in vision transformers (ViTs) trained using self-supervised methods such as DINO, but not in those trained on supervised classification tasks. In this study, we probe whether segmentatio… ▽ More

    Submitted 30 August, 2023; originally announced August 2023.

    Comments: Code: https://github.com/Ma-Lab-Berkeley/CRATE

  40. arXiv:2307.08556  [pdf, other

    stat.ML cs.LG eess.IV

    Machine-Learning-based Colorectal Tissue Classification via Acoustic Resolution Photoacoustic Microscopy

    Authors: Shangqing Tong, Peng Ge, Yanan Jiao, Zhaofu Ma, Ziye Li, Longhai Liu, Feng Gao, Xiaohui Du, Fei Gao

    Abstract: Colorectal cancer is a deadly disease that has become increasingly prevalent in recent years. Early detection is crucial for saving lives, but traditional diagnostic methods such as colonoscopy and biopsy have limitations. Colonoscopy cannot provide detailed information within the tissues affected by cancer, while biopsy involves tissue removal, which can be painful and invasive. In order to impro… ▽ More

    Submitted 17 July, 2023; originally announced July 2023.

  41. arXiv:2306.13843  [pdf, other

    cs.CV eess.IV

    Score-based Generative Models for Photoacoustic Image Reconstruction with Rotation Consistency Constraints

    Authors: Shangqing Tong, Hengrong Lan, Liming Nie, Jianwen Luo, Fei Gao

    Abstract: Photoacoustic tomography (PAT) is a newly emerged imaging modality which enables both high optical contrast and acoustic depth of penetration. Reconstructing images of photoacoustic tomography from limited amount of senser data is among one of the major challenges in photoacoustic imaging. Previous works based on deep learning were trained in supervised fashion, which directly map the input partia… ▽ More

    Submitted 23 June, 2023; originally announced June 2023.

  42. arXiv:2306.12105  [pdf, other

    cs.LG cs.CL cs.SE

    Mass-Producing Failures of Multimodal Systems with Language Models

    Authors: Shengbang Tong, Erik Jones, Jacob Steinhardt

    Abstract: Deployed multimodal systems can fail in ways that evaluators did not anticipate. In order to find these failures before deployment, we introduce MultiMon, a system that automatically identifies systematic failures -- generalizable, natural-language descriptions of patterns of model failures. To uncover systematic failures, MultiMon scrapes a corpus for examples of erroneous agreement: inputs that… ▽ More

    Submitted 1 March, 2024; v1 submitted 21 June, 2023; originally announced June 2023.

    Comments: Under Review

  43. arXiv:2306.05272  [pdf, other

    cs.CV cs.LG

    Image Clustering via the Principle of Rate Reduction in the Age of Pretrained Models

    Authors: Tianzhe Chu, Shengbang Tong, Tianjiao Ding, Xili Dai, Benjamin David Haeffele, René Vidal, Yi Ma

    Abstract: The advent of large pre-trained models has brought about a paradigm shift in both visual representation learning and natural language processing. However, clustering unlabeled images, as a fundamental and classic machine learning problem, still lacks an effective solution, particularly for large-scale datasets. In this paper, we propose a novel image clustering pipeline that leverages the powerful… ▽ More

    Submitted 26 April, 2024; v1 submitted 8 June, 2023; originally announced June 2023.

    Comments: 23 pages, 14 figures

  44. arXiv:2306.01129  [pdf, other

    cs.LG

    White-Box Transformers via Sparse Rate Reduction

    Authors: Yaodong Yu, Sam Buchanan, Druv Pai, Tianzhe Chu, Ziyang Wu, Shengbang Tong, Benjamin D. Haeffele, Yi Ma

    Abstract: In this paper, we contend that the objective of representation learning is to compress and transform the distribution of the data, say sets of tokens, towards a mixture of low-dimensional Gaussian distributions supported on incoherent subspaces. The quality of the final representation can be measured by a unified objective function called sparse rate reduction. From this perspective, popular deep… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: 33 pages, 11 figures

  45. arXiv:2305.15685  [pdf, other

    cs.CL cs.AI

    RewriteLM: An Instruction-Tuned Large Language Model for Text Rewriting

    Authors: Lei Shu, Liangchen Luo, Jayakumar Hoskere, Yun Zhu, Yinxiao Liu, Simon Tong, Jindong Chen, Lei Meng

    Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in creative tasks such as storytelling and E-mail generation. However, as LLMs are primarily trained on final text results rather than intermediate revisions, it might be challenging for them to perform text rewriting tasks. Most studies in the rewriting tasks focus on a particular transformation type within the boundaries of s… ▽ More

    Submitted 19 December, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

    Journal ref: AAAI 2024

  46. arXiv:2305.14760  [pdf, other

    cs.CL

    Bi-Drop: Enhancing Fine-tuning Generalization via Synchronous sub-net Estimation and Optimization

    Authors: Shoujie Tong, Heming Xia, Damai Dai, Runxin Xu, Tianyu Liu, Binghuai Lin, Yunbo Cao, Zhifang Sui

    Abstract: Pretrained language models have achieved remarkable success in natural language understanding. However, fine-tuning pretrained models on limited training data tends to overfit and thus diminish performance. This paper presents Bi-Drop, a fine-tuning strategy that selectively updates model parameters using gradients from various sub-nets dynamically generated by dropout. The sub-net estimation of B… ▽ More

    Submitted 22 October, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

    Comments: EMNLP 2023 Findings. Camera-ready version. Co-first authors with equal contributions

  47. arXiv:2304.03977  [pdf, other

    cs.CV cs.AI

    EMP-SSL: Towards Self-Supervised Learning in One Training Epoch

    Authors: Shengbang Tong, Yubei Chen, Yi Ma, Yann Lecun

    Abstract: Recently, self-supervised learning (SSL) has achieved tremendous success in learning image representation. Despite the empirical success, most self-supervised learning methods are rather "inefficient" learners, typically taking hundreds of training epochs to fully converge. In this work, we show that the key towards efficient self-supervised learning is to increase the number of crops from each im… ▽ More

    Submitted 8 April, 2023; originally announced April 2023.

  48. arXiv:2302.09347  [pdf, other

    cs.CV

    Closed-Loop Transcription via Convolutional Sparse Coding

    Authors: Xili Dai, Ke Chen, Shengbang Tong, Jingyuan Zhang, Xingjian Gao, Mingyang Li, Druv Pai, Yuexiang Zhai, XIaojun Yuan, Heung-Yeung Shum, Lionel M. Ni, Yi Ma

    Abstract: Autoencoding has achieved great empirical success as a framework for learning generative models for natural images. Autoencoders often use generic deep networks as the encoder or decoder, which are difficult to interpret, and the learned representations lack clear structure. In this work, we make the explicit assumption that the image distribution is generated from a multi-stage sparse deconvoluti… ▽ More

    Submitted 18 February, 2023; originally announced February 2023.

    Comments: 20 pages

  49. arXiv:2302.04265  [pdf, other

    cs.LG cs.CV

    PFGM++: Unlocking the Potential of Physics-Inspired Generative Models

    Authors: Yilun Xu, Ziming Liu, Yonglong Tian, Shangyuan Tong, Max Tegmark, Tommi Jaakkola

    Abstract: We introduce a new family of physics-inspired generative models termed PFGM++ that unifies diffusion models and Poisson Flow Generative Models (PFGM). These models realize generative trajectories for $N$ dimensional data by embedding paths in $N{+}D$ dimensional space while still controlling the progression with a simple scalar norm of the $D$ additional variables. The new models reduce to PFGM wh… ▽ More

    Submitted 10 February, 2023; v1 submitted 8 February, 2023; originally announced February 2023.

    Comments: Code is available at https://github.com/Newbeeer/pfgmpp

  50. arXiv:2302.00670  [pdf, other

    cs.LG cs.CV

    Stable Target Field for Reduced Variance Score Estimation in Diffusion Models

    Authors: Yilun Xu, Shangyuan Tong, Tommi Jaakkola

    Abstract: Diffusion models generate samples by reversing a fixed forward diffusion process. Despite already providing impressive empirical results, these diffusion models algorithms can be further improved by reducing the variance of the training targets in their denoising score-matching objective. We argue that the source of such variance lies in the handling of intermediate noise-variance scales, where mu… ▽ More

    Submitted 17 February, 2023; v1 submitted 1 February, 2023; originally announced February 2023.

    Comments: Accepted by ICLR 2023. Code available at: https://github.com/Newbeeer/stf