Search | arXiv e-print repository

From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models

Authors: Irving Fang, Juexiao Zhang, Shengbang Tong, Chen Feng

Abstract: One promise that Vision-Language-Action (VLA) models hold over traditional imitation learning for robotics is to leverage the broad generalization capabilities of large Vision-Language Models (VLMs) to produce versatile, "generalist" robot policies. However, current evaluations of VLAs remain insufficient. Traditional imitation learning benchmarks are unsuitable due to the lack of language instruc… ▽ More One promise that Vision-Language-Action (VLA) models hold over traditional imitation learning for robotics is to leverage the broad generalization capabilities of large Vision-Language Models (VLMs) to produce versatile, "generalist" robot policies. However, current evaluations of VLAs remain insufficient. Traditional imitation learning benchmarks are unsuitable due to the lack of language instructions. Emerging benchmarks for VLAs that incorporate language often come with limited evaluation tasks and do not intend to investigate how much VLM pretraining truly contributes to the generalization capabilities of the downstream robotic policy. Meanwhile, much research relies on real-world robot setups designed in isolation by different institutions, which creates a barrier for reproducibility and accessibility. To address this gap, we introduce a unified probing suite of 50 simulation-based tasks across 10 subcategories spanning language instruction, vision, and objects. We systematically evaluate several state-of-the-art VLA architectures on this suite to understand their generalization capability. Our results show that while VLM backbones endow VLAs with robust perceptual understanding and high level planning, which we refer to as good intentions, this does not reliably translate into precise motor execution: when faced with out-of-distribution observations, policies often exhibit coherent intentions, but falter in action execution. Moreover, finetuning on action data can erode the original VLM's generalist reasoning abilities. We release our task suite and evaluation code to serve as a standardized benchmark for future VLAs and to drive research on closing the perception-to-action gap. More information, including the source code, can be found at https://ai4ce.github.io/INT-ACT/ △ Less

Submitted 11 June, 2025; originally announced June 2025.

Comments: Under review

arXiv:2506.07976 [pdf, ps, other]

Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction

Authors: Junhong Shen, Hao Bai, Lunjun Zhang, Yifei Zhou, Amrith Setlur, Shengbang Tong, Diego Caples, Nan Jiang, Tong Zhang, Ameet Talwalkar, Aviral Kumar

Abstract: The current paradigm of test-time scaling relies on generating long reasoning traces ("thinking" more) before producing a response. In agent problems that require interaction, this can be done by generating thinking traces before acting in the world. However, this process does not allow agents to acquire new information from the environment or adapt their behavior over time. In this work, we propo… ▽ More The current paradigm of test-time scaling relies on generating long reasoning traces ("thinking" more) before producing a response. In agent problems that require interaction, this can be done by generating thinking traces before acting in the world. However, this process does not allow agents to acquire new information from the environment or adapt their behavior over time. In this work, we propose to scale test-time interaction, an untapped dimension of test-time scaling that increases the agent's interaction horizon to enable running rich behaviors such as exploration, backtracking, and dynamic re-planning within a single rollout. To demonstrate the promise of this scaling dimension, we study the domain of web agents. We first show that even prompting-based interaction scaling without any training can improve task success on web benchmarks non-trivially. Building on this, we introduce TTI (Test-Time Interaction), a curriculum-based online reinforcement learning (RL) approach that trains agents by adaptively adjusting their rollout lengths. Using a Gemma 3 12B model, TTI produces state-of-the-art open-source, open-data web agents on WebVoyager and WebArena benchmarks. We further show that TTI enables agents to balance exploration and exploitation adaptively. Our results establish interaction scaling as a powerful, complementary axis to scaling per-step compute, offering new avenues for training adaptive agents. △ Less

Submitted 10 June, 2025; v1 submitted 9 June, 2025; originally announced June 2025.

Comments: Fixed typo in Figure 6 and Conclusion

arXiv:2506.05276 [pdf, ps, other]

How to Unlock Time Series Editing? Diffusion-Driven Approach with Multi-Grained Control

Authors: Hao Yu, Chu Xin Cheng, Runlong Yu, Yuyang Ye, Shiwei Tong, Zhaofeng Liu, Defu Lian

Abstract: Recent advances in time series generation have shown promise, yet controlling properties in generated sequences remains challenging. Time Series Editing (TSE) - making precise modifications while preserving temporal coherence - consider both point-level constraints and segment-level controls that current methods struggle to provide. We introduce the CocktailEdit framework to enable simultaneous, f… ▽ More Recent advances in time series generation have shown promise, yet controlling properties in generated sequences remains challenging. Time Series Editing (TSE) - making precise modifications while preserving temporal coherence - consider both point-level constraints and segment-level controls that current methods struggle to provide. We introduce the CocktailEdit framework to enable simultaneous, flexible control across different types of constraints. This framework combines two key mechanisms: a confidence-weighted anchor control for point-wise constraints and a classifier-based control for managing statistical properties such as sums and averages over segments. Our methods achieve precise local control during the denoising inference stage while maintaining temporal coherence and integrating seamlessly, with any conditionally trained diffusion-based time series models. Extensive experiments across diverse datasets and models demonstrate its effectiveness. Our work bridges the gap between pure generative modeling and real-world time series editing needs, offering a flexible solution for human-in-the-loop time series generation and editing. The code and demo are provided for validation. △ Less

Submitted 5 June, 2025; originally announced June 2025.

arXiv:2506.01738 [pdf, ps, other]

STORM: Benchmarking Visual Rating of MLLMs with a Comprehensive Ordinal Regression Dataset

Authors: Jinhong Wang, Shuo Tong, Jian liu, Dongqi Tang, Jintai Chen, Haochao Ying, Hongxia Xu, Danny Chen, Jian Wu

Abstract: Visual rating is an essential capability of artificial intelligence (AI) for multi-dimensional quantification of visual content, primarily applied in ordinal regression (OR) tasks such as image quality assessment, facial age estimation, and medical image grading. However, current multi-modal large language models (MLLMs) under-perform in such visual rating ability while also suffering the lack of… ▽ More Visual rating is an essential capability of artificial intelligence (AI) for multi-dimensional quantification of visual content, primarily applied in ordinal regression (OR) tasks such as image quality assessment, facial age estimation, and medical image grading. However, current multi-modal large language models (MLLMs) under-perform in such visual rating ability while also suffering the lack of relevant datasets and benchmarks. In this work, we collect and present STORM, a data collection and benchmark for Stimulating Trustworthy Ordinal Regression Ability of MLLMs for universal visual rating. STORM encompasses 14 ordinal regression datasets across five common visual rating domains, comprising 655K image-level pairs and the corresponding carefully curated VQAs. Importantly, we also propose a coarse-to-fine processing pipeline that dynamically considers label candidates and provides interpretable thoughts, providing MLLMs with a general and trustworthy ordinal thinking paradigm. This benchmark aims to evaluate the all-in-one and zero-shot performance of MLLMs in scenarios requiring understanding of the essential common ordinal relationships of rating labels. Extensive experiments demonstrate the effectiveness of our framework and shed light on better fine-tuning strategies. The STORM dataset, benchmark, and pre-trained models are available on the following webpage to support further research in this area. Datasets and codes are released on the project page: https://storm-bench.github.io/. △ Less

Submitted 2 June, 2025; originally announced June 2025.

Comments: underreview of NIPS2025 D&B track

arXiv:2505.00592 [pdf, other]

Uncertainty-Aware Multi-Expert Knowledge Distillation for Imbalanced Disease Grading

Authors: Shuo Tong, Shangde Gao, Ke Liu, Zihang Huang, Hongxia Xu, Haochao Ying, Jian Wu

Abstract: Automatic disease image grading is a significant application of artificial intelligence for healthcare, enabling faster and more accurate patient assessments. However, domain shifts, which are exacerbated by data imbalance, introduce bias into the model, posing deployment difficulties in clinical applications. To address the problem, we propose a novel \textbf{U}ncertainty-aware \textbf{M}ulti-exp… ▽ More Automatic disease image grading is a significant application of artificial intelligence for healthcare, enabling faster and more accurate patient assessments. However, domain shifts, which are exacerbated by data imbalance, introduce bias into the model, posing deployment difficulties in clinical applications. To address the problem, we propose a novel \textbf{U}ncertainty-aware \textbf{M}ulti-experts \textbf{K}nowledge \textbf{D}istillation (UMKD) framework to transfer knowledge from multiple expert models to a single student model. Specifically, to extract discriminative features, UMKD decouples task-agnostic and task-specific features with shallow and compact feature alignment in the feature space. At the output space, an uncertainty-aware decoupled distillation (UDD) mechanism dynamically adjusts knowledge transfer weights based on expert model uncertainties, ensuring robust and reliable distillation. Additionally, UMKD also tackles the problems of model architecture heterogeneity and distribution discrepancies between source and target domains, which are inadequately tackled by previous KD approaches. Extensive experiments on histology prostate grading (\textit{SICAPv2}) and fundus image grading (\textit{APTOS}) demonstrate that UMKD achieves a new state-of-the-art in both source-imbalanced and target-imbalanced scenarios, offering a robust and practical solution for real-world disease image grading. △ Less

Submitted 1 May, 2025; originally announced May 2025.

arXiv:2504.15280 [pdf, other]

Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs

Authors: Chun-Hsiao Yeh, Chenyu Wang, Shengbang Tong, Ta-Ying Cheng, Ruoyu Wang, Tianzhe Chu, Yuexiang Zhai, Yubei Chen, Shenghua Gao, Yi Ma

Abstract: Multi-view understanding, the ability to reconcile visual information across diverse viewpoints for effective navigation, manipulation, and 3D scene comprehension, is a fundamental challenge in Multi-Modal Large Language Models (MLLMs) to be used as embodied agents. While recent MLLMs have shown impressive advances in high-level reasoning and planning, they frequently fall short when confronted wi… ▽ More Multi-view understanding, the ability to reconcile visual information across diverse viewpoints for effective navigation, manipulation, and 3D scene comprehension, is a fundamental challenge in Multi-Modal Large Language Models (MLLMs) to be used as embodied agents. While recent MLLMs have shown impressive advances in high-level reasoning and planning, they frequently fall short when confronted with multi-view geometric consistency and cross-view correspondence. To comprehensively evaluate the challenges of MLLMs in multi-view scene reasoning, we propose All-Angles Bench, a benchmark of over 2,100 human carefully annotated multi-view question-answer pairs across 90 diverse real-world scenes. Our six tasks (counting, attribute identification, relative distance, relative direction, object manipulation, and camera pose estimation) specifically test model's geometric correspondence and the capacity to align information consistently across views. Our extensive experiments, benchmark on 27 representative MLLMs including Gemini-2.0-Flash, Claude-3.7-Sonnet, and GPT-4o against human evaluators reveals a substantial performance gap, indicating that current MLLMs remain far from human-level proficiency. Through in-depth analysis, we show that MLLMs are particularly underperforming under two aspects: (1) cross-view correspondence for partially occluded views and (2) establishing the coarse camera poses. These findings highlight the necessity of domain-specific refinements or modules that embed stronger multi-view awareness. We believe that our All-Angles Bench offers valuable insights and contribute to bridging the gap between MLLMs and human-level multi-view understanding. The project and benchmark are publicly available at https://danielchyeh.github.io/All-Angles-Bench/. △ Less

Submitted 26 April, 2025; v1 submitted 21 April, 2025; originally announced April 2025.

Comments: Project page: https://danielchyeh.github.io/All-Angles-Bench/

arXiv:2504.14891 [pdf, other]

Retrieval Augmented Generation Evaluation in the Era of Large Language Models: A Comprehensive Survey

Authors: Aoran Gan, Hao Yu, Kai Zhang, Qi Liu, Wenyu Yan, Zhenya Huang, Shiwei Tong, Guoping Hu

Abstract: Recent advancements in Retrieval-Augmented Generation (RAG) have revolutionized natural language processing by integrating Large Language Models (LLMs) with external information retrieval, enabling accurate, up-to-date, and verifiable text generation across diverse applications. However, evaluating RAG systems presents unique challenges due to their hybrid architecture that combines retrieval and… ▽ More Recent advancements in Retrieval-Augmented Generation (RAG) have revolutionized natural language processing by integrating Large Language Models (LLMs) with external information retrieval, enabling accurate, up-to-date, and verifiable text generation across diverse applications. However, evaluating RAG systems presents unique challenges due to their hybrid architecture that combines retrieval and generation components, as well as their dependence on dynamic knowledge sources in the LLM era. In response, this paper provides a comprehensive survey of RAG evaluation methods and frameworks, systematically reviewing traditional and emerging evaluation approaches, for system performance, factual accuracy, safety, and computational efficiency in the LLM era. We also compile and categorize the RAG-specific datasets and evaluation frameworks, conducting a meta-analysis of evaluation practices in high-impact RAG research. To the best of our knowledge, this work represents the most comprehensive survey for RAG evaluation, bridging traditional and LLM-driven methods, and serves as a critical resource for advancing RAG development. △ Less

Submitted 21 April, 2025; originally announced April 2025.

Comments: 18 pages, 5 figures

arXiv:2504.05778 [pdf, other]

Residual U-Net for accurate and efficient prediction of hemodynamics in two-dimensional asymmetric stenosis

Authors: Xintong Zou, Suiyang Tong, Wenhui Peng, Qiuxiang Huang, Jianchun Wang

Abstract: This study presents residual U-Net (U-ResNet), a deep learning surrogate model for predicting steady hemodynamic fields in two-dimensional asymmetric stenotic channels at Reynolds numbers ranging from 200 to 800. By integrating residual connections with multi-scale feature extraction, U-ResNet achieves exceptional accuracy while significantly reducing computational costs compared to computational… ▽ More This study presents residual U-Net (U-ResNet), a deep learning surrogate model for predicting steady hemodynamic fields in two-dimensional asymmetric stenotic channels at Reynolds numbers ranging from 200 to 800. By integrating residual connections with multi-scale feature extraction, U-ResNet achieves exceptional accuracy while significantly reducing computational costs compared to computational fluid dynamics (CFD) approaches. Comprehensive evaluation against U-Net, Fourier Neural Operator (FNO), and U-Net enhanced Fourier Neural Operator (UFNO) demonstrates U-ResNet superior performance in capturing sharp hemodynamic gradients and complex flow features. For pressure prediction, U-ResNet achieves a normalized mean absolute error (NMAE) of 1.10%. Similarly, the performance of U-ResNet for wall shear stress (NMAE: 0.56%), velocity (NMAE: 1.06%), and vorticity (NMAE: 0.69%) consistently surpasses alternative architectures. Notably, U-ResNet demonstrates robust generalization to interpolated Reynolds numbers without retraining - a capability rarely achieved in existing models. From a computational perspective, U-ResNet delivers a 180-fold acceleration over CFD, reducing simulation time from approximately 30 minutes to 10 seconds per case. The model with non-dimensional formulation ensures scalability across vessel sizes and anatomical locations, enhancing its applicability to diverse clinical scenarios. These advances position U-ResNet as a promising auxiliary tool to complement CFD simulations for real-time clinical decision support, treatment planning, and medical device optimization. Future work will focus on extending the framework to three-dimensional geometries and integrating it with patient-specific data. △ Less

Submitted 27 May, 2025; v1 submitted 8 April, 2025; originally announced April 2025.

arXiv:2504.04801 [pdf, other]

OrderChain: A General Prompting Paradigm to Improve Ordinal Understanding Ability of MLLM

Authors: Jinhong Wang, Shuo Tong, Jian liu, Dongqi Tang, Weiqiang Wang, Wentong Li, Hongxia Xu, Danny Chen, Jintai Chen, Jian Wu

Abstract: Despite the remarkable progress of multimodal large language models (MLLMs), they continue to face challenges in achieving competitive performance on ordinal regression (OR; a.k.a. ordinal classification). To address this issue, this paper presents OrderChain, a novel and general prompting paradigm that improves the ordinal understanding ability of MLLMs by specificity and commonality modeling. Sp… ▽ More Despite the remarkable progress of multimodal large language models (MLLMs), they continue to face challenges in achieving competitive performance on ordinal regression (OR; a.k.a. ordinal classification). To address this issue, this paper presents OrderChain, a novel and general prompting paradigm that improves the ordinal understanding ability of MLLMs by specificity and commonality modeling. Specifically, our OrderChain consists of a set of task-aware prompts to facilitate the specificity modeling of diverse OR tasks and a new range optimization Chain-of-Thought (RO-CoT), which learns a commonality way of thinking about OR tasks by uniformly decomposing them into multiple small-range optimization subtasks. Further, we propose a category recursive division (CRD) method to generate instruction candidate category prompts to support RO-CoT automatic optimization. Comprehensive experiments show that a Large Language and Vision Assistant (LLaVA) model with our OrderChain improves baseline LLaVA significantly on diverse OR datasets, e.g., from 47.5% to 93.2% accuracy on the Adience dataset for age estimation, and from 30.0% to 85.7% accuracy on the Diabetic Retinopathy dataset. Notably, LLaVA with our OrderChain also remarkably outperforms state-of-the-art methods by 27% on accuracy and 0.24 on MAE on the Adience dataset. To our best knowledge, our OrderChain is the first work that augments MLLMs for OR tasks, and the effectiveness is witnessed across a spectrum of OR datasets. △ Less

Submitted 7 April, 2025; originally announced April 2025.

arXiv:2504.04431 [pdf]

Observation of Dislocation Non-Hermitian Skin Effect

Authors: Wenquan Wu, Qicheng Zhang, Liangjun Qi, Kun Zhang, Shuaishuai Tong, Chunyin Qiu

Abstract: The non-Hermitian skin effect (NHSE), a striking phenomenon where a large number of states accumulate toward open boundaries, has garnered significant attention in both fundamental physics and emerging applications. Recent theoretical studies unveiled a distinctive dislocation NHSE by disentangling it from the established boundary NHSE, thereby bridging the gap between topological defects and non-… ▽ More The non-Hermitian skin effect (NHSE), a striking phenomenon where a large number of states accumulate toward open boundaries, has garnered significant attention in both fundamental physics and emerging applications. Recent theoretical studies unveiled a distinctive dislocation NHSE by disentangling it from the established boundary NHSE, thereby bridging the gap between topological defects and non-Hermitian effects. In this Letter, we report the first experimental observation of the dislocation NHSE, achieved using an ingeniously designed nonreciprocal, torus-like acoustic lattice with two dislocations of opposite Burgers vectors. Our results show that the sound energy density inside the sample dramatically accumulates at one dislocation, while being unusually depleted at the other, a response distinct from all existing NHSE phenomena. This novel non-Hermitian effect not only probes the interplay between nontrivial defects and point-gap topology, but also holds promise for practical applications, such as the design of topological sound vacuum pumps. △ Less

Submitted 6 April, 2025; originally announced April 2025.

arXiv:2504.01017 [pdf, other]

Scaling Language-Free Visual Representation Learning

Authors: David Fan, Shengbang Tong, Jiachen Zhu, Koustuv Sinha, Zhuang Liu, Xinlei Chen, Michael Rabbat, Nicolas Ballas, Yann LeCun, Amir Bar, Saining Xie

Abstract: Visual Self-Supervised Learning (SSL) currently underperforms Contrastive Language-Image Pretraining (CLIP) in multimodal settings such as Visual Question Answering (VQA). This multimodal gap is often attributed to the semantics introduced by language supervision, even though visual SSL and CLIP models are often trained on different data. In this work, we ask the question: "Do visual self-supervis… ▽ More Visual Self-Supervised Learning (SSL) currently underperforms Contrastive Language-Image Pretraining (CLIP) in multimodal settings such as Visual Question Answering (VQA). This multimodal gap is often attributed to the semantics introduced by language supervision, even though visual SSL and CLIP models are often trained on different data. In this work, we ask the question: "Do visual self-supervised approaches lag behind CLIP due to the lack of language supervision, or differences in the training data?" We study this question by training both visual SSL and CLIP models on the same MetaCLIP data, and leveraging VQA as a diverse testbed for vision encoders. In this controlled setup, visual SSL models scale better than CLIP models in terms of data and model capacity, and visual SSL performance does not saturate even after scaling up to 7B parameters. Consequently, we observe visual SSL methods achieve CLIP-level performance on a wide range of VQA and classic vision benchmarks. These findings demonstrate that pure visual SSL can match language-supervised visual pretraining at scale, opening new opportunities for vision-centric representation learning. △ Less

Submitted 1 April, 2025; originally announced April 2025.

Comments: Project page at https://davidfan.io/webssl/

arXiv:2503.23205 [pdf, ps, other]

Enhancing Knowledge Graph Completion with Entity Neighborhood and Relation Context

Authors: Jianfang Chen, Kai Zhang, Aoran Gan, Shiwei Tong, Shuanghong Shen, Qi Liu

Abstract: Knowledge Graph Completion (KGC) aims to infer missing information in Knowledge Graphs (KGs) to address their inherent incompleteness. Traditional structure-based KGC methods, while effective, face significant computational demands and scalability challenges due to the need for dense embedding learning and scoring all entities in the KG for each prediction. Recent text-based approaches using langu… ▽ More Knowledge Graph Completion (KGC) aims to infer missing information in Knowledge Graphs (KGs) to address their inherent incompleteness. Traditional structure-based KGC methods, while effective, face significant computational demands and scalability challenges due to the need for dense embedding learning and scoring all entities in the KG for each prediction. Recent text-based approaches using language models like T5 and BERT have mitigated these issues by converting KG triples into text for reasoning. However, they often fail to fully utilize contextual information, focusing mainly on the neighborhood of the entity and neglecting the context of the relation. To address this issue, we propose KGC-ERC, a framework that integrates both types of context to enrich the input of generative language models and enhance their reasoning capabilities. Additionally, we introduce a sampling strategy to effectively select relevant context within input token constraints, which optimizes the utilization of contextual information and potentially improves model performance. Experiments on the Wikidata5M, Wiki27K, and FB15K-237-N datasets show that KGC-ERC outperforms or matches state-of-the-art baselines in predictive performance and scalability. △ Less

Submitted 29 March, 2025; originally announced March 2025.

arXiv:2503.13551 [pdf, other]

Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models

Authors: Teng Wang, Zhangyi Jiang, Zhenqi He, Shenyang Tong, Wenhan Yang, Yanan Zheng, Zeyu Li, Zifan He, Hailei Gong

Abstract: Recent studies show that Large Language Models (LLMs) achieve strong reasoning capabilities through supervised fine-tuning or reinforcement learning. However, a key approach, the Process Reward Model (PRM), suffers from reward hacking, making it unreliable in identifying the best intermediate step. In addition, the cost of annotating reasoning processes for reward modeling is high, making large-sc… ▽ More Recent studies show that Large Language Models (LLMs) achieve strong reasoning capabilities through supervised fine-tuning or reinforcement learning. However, a key approach, the Process Reward Model (PRM), suffers from reward hacking, making it unreliable in identifying the best intermediate step. In addition, the cost of annotating reasoning processes for reward modeling is high, making large-scale collection of high-quality data challenging. To address this, we propose a novel reward model approach called the Hierarchical Reward Model (HRM), which evaluates both individual and consecutive reasoning steps at both fine-grained and coarse-grained levels. HRM excels at assessing multi-step reasoning coherence, especially when flawed steps are later corrected through self-reflection. To further reduce the cost of generating training data, we introduce a lightweight and effective data augmentation strategy called Hierarchical Node Compression (HNC), which merges two consecutive reasoning steps into one within the tree structure. By applying HNC to MCTS-generated reasoning trajectories, we enhance the diversity and robustness of HRM training data while introducing controlled noise with minimal computational overhead. Empirical results on the PRM800K dataset show that HRM, together with HNC, provides more stable and reliable evaluations than PRM. Furthermore, cross-domain evaluations on the MATH500 and GSM8K datasets demonstrate HRM's strong generalization and robustness across a variety of reasoning tasks. △ Less

Submitted 6 May, 2025; v1 submitted 16 March, 2025; originally announced March 2025.

arXiv:2503.13508 [pdf]

It is Too Many Options: Pitfalls of Multiple-Choice Questions in Generative AI and Medical Education

Authors: Shrutika Singh, Anton Alyakin, Daniel Alexander Alber, Jaden Stryker, Ai Phuong S Tong, Karl Sangwon, Nicolas Goff, Mathew de la Paz, Miguel Hernandez-Rovira, Ki Yun Park, Eric Claude Leuthardt, Eric Karl Oermann

Abstract: The performance of Large Language Models (LLMs) on multiple-choice question (MCQ) benchmarks is frequently cited as proof of their medical capabilities. We hypothesized that LLM performance on medical MCQs may in part be illusory and driven by factors beyond medical content knowledge and reasoning capabilities. To assess this, we created a novel benchmark of free-response questions with paired MCQ… ▽ More The performance of Large Language Models (LLMs) on multiple-choice question (MCQ) benchmarks is frequently cited as proof of their medical capabilities. We hypothesized that LLM performance on medical MCQs may in part be illusory and driven by factors beyond medical content knowledge and reasoning capabilities. To assess this, we created a novel benchmark of free-response questions with paired MCQs (FreeMedQA). Using this benchmark, we evaluated three state-of-the-art LLMs (GPT-4o, GPT-3.5, and LLama-3-70B-instruct) and found an average absolute deterioration of 39.43% in performance on free-response questions relative to multiple-choice (p = 1.3 * 10-5) which was greater than the human performance decline of 22.29%. To isolate the role of the MCQ format on performance, we performed a masking study, iteratively masking out parts of the question stem. At 100% masking, the average LLM multiple-choice performance was 6.70% greater than random chance (p = 0.002) with one LLM (GPT-4o) obtaining an accuracy of 37.34%. Notably, for all LLMs the free-response performance was near zero. Our results highlight the shortcomings in medical MCQ benchmarks for overestimating the capabilities of LLMs in medicine, and, broadly, the potential for improving both human and machine assessments using LLM-evaluated free-response questions. △ Less

Submitted 13 March, 2025; originally announced March 2025.

Comments: 14 pages, 5 figures

arXiv:2503.11531 [pdf, other]

Potential of large language model-powered nudges for promoting daily water and energy conservation

Authors: Zonghan Li, Song Tong, Yi Liu, Kaiping Peng, Chunyan Wang

Abstract: The increasing amount of pressure related to water and energy shortages has increased the urgency of cultivating individual conservation behaviors. While the concept of nudging, i.e., providing usage-based feedback, has shown promise in encouraging conservation behaviors, its efficacy is often constrained by the lack of targeted and actionable content. This study investigates the impact of the use… ▽ More The increasing amount of pressure related to water and energy shortages has increased the urgency of cultivating individual conservation behaviors. While the concept of nudging, i.e., providing usage-based feedback, has shown promise in encouraging conservation behaviors, its efficacy is often constrained by the lack of targeted and actionable content. This study investigates the impact of the use of large language models (LLMs) to provide tailored conservation suggestions for conservation intentions and their rationale. Through a survey experiment with 1,515 university participants, we compare three virtual nudging scenarios: no nudging, traditional nudging with usage statistics, and LLM-powered nudging with usage statistics and personalized conservation suggestions. The results of statistical analyses and causal forest modeling reveal that nudging led to an increase in conservation intentions among 86.9%-98.0% of the participants. LLM-powered nudging achieved a maximum increase of 18.0% in conservation intentions, surpassing traditional nudging by 88.6%. Furthermore, structural equation modeling results reveal that exposure to LLM-powered nudges enhances self-efficacy and outcome expectations while diminishing dependence on social norms, thereby increasing intrinsic motivation to conserve. These findings highlight the transformative potential of LLMs in promoting individual water and energy conservation, representing a new frontier in the design of sustainable behavioral interventions and resource management. △ Less

Submitted 14 March, 2025; originally announced March 2025.

arXiv:2503.01152 [pdf, other]

STGAN: Spatial-temporal Graph Autoregression Network for Pavement Distress Deterioration Prediction

Authors: Shilin Tong, Difei Wu, Xiaona Liu, Le Zheng, Yuchuan Du, Difan Zou

Abstract: Pavement distress significantly compromises road integrity and poses risks to drivers. Accurate prediction of pavement distress deterioration is essential for effective road management, cost reduction in maintenance, and improvement of traffic safety. However, real-world data on pavement distress is usually collected irregularly, resulting in uneven, asynchronous, and sparse spatial-temporal datas… ▽ More Pavement distress significantly compromises road integrity and poses risks to drivers. Accurate prediction of pavement distress deterioration is essential for effective road management, cost reduction in maintenance, and improvement of traffic safety. However, real-world data on pavement distress is usually collected irregularly, resulting in uneven, asynchronous, and sparse spatial-temporal datasets. This hinders the application of existing spatial-temporal models, such as DCRNN, since they are only applicable to regularly and synchronously collected data. To overcome these challenges, we propose the Spatial-Temporal Graph Autoregression Network (STGAN), a novel graph neural network model designed for accurately predicting irregular pavement distress deterioration using complex spatial-temporal data. Specifically, STGAN integrates the temporal domain into the spatial domain, creating a larger graph where nodes are represented by spatial-temporal tuples and edges are formed based on a similarity-based connection mechanism. Furthermore, based on the constructed spatiotemporal graph, we formulate pavement distress deterioration prediction as a graph autoregression task, i.e., the graph size increases incrementally and the prediction is performed sequentially. This is accomplished by a novel spatial-temporal attention mechanism deployed by STGAN. Utilizing the ConTrack dataset, which contains pavement distress records collected from different locations in Shanghai, we demonstrate the superior performance of STGAN in capturing spatial-temporal correlations and addressing the aforementioned challenges. Experimental results further show that STGAN outperforms baseline models, and ablation studies confirm the effectiveness of its novel modules. Our findings contribute to promoting proactive road maintenance decision-making and ultimately enhancing road safety and resilience. △ Less

Submitted 2 March, 2025; originally announced March 2025.

Comments: 16 pages, 16 figures, 4 tables, accepted by IEEE Transactions on Intelligent Transportation Systems (TITS)

arXiv:2502.04452 [pdf, other]

Compact protoplanetary discs can be produced by dead zones

Authors: Simin Tong, Richard Alexander

Abstract: Radially compact protoplanetary discs (<=50 au) are ubiquitous in nearby star-forming regions. Multiple mechanisms have been invoked to interpret various compact discs. In this paper, we propose that fragmentation of fragile dust grains in moderate turbulence, as expected beyond the dead zone, provides an effective alternative mechanism to form compact discs which are consistent with current obser… ▽ More Radially compact protoplanetary discs (<=50 au) are ubiquitous in nearby star-forming regions. Multiple mechanisms have been invoked to interpret various compact discs. In this paper, we propose that fragmentation of fragile dust grains in moderate turbulence, as expected beyond the dead zone, provides an effective alternative mechanism to form compact discs which are consistent with current observations. We run 1-D dust transport and collision models with DustPy and generate synthetic observations, and find that discs formed by this mechanism have sizes determined by the extent of their dead zones. Accounting for dust porosity, and considering less fragile dust, do not change disc sizes significantly. The smooth dust morphology can be altered only when pressure bumps are present in the dead zone. However, when present at small radii (<=10 au), pressure bumps cannot effectively trap dust. Dust in these bumps fragments and replenishes the inner discs, effectively hiding dust traps in the optically thick inner disc from observations. We note a striking resemblance in the radial intensity profile between our synthetic observations and some recent high-resolution observations of compact discs. We discuss how such observations can inform our understanding of the underlying disc physics. △ Less

Submitted 6 February, 2025; originally announced February 2025.

Comments: 17 pages, 14+1 figures. Accepted for publication in MNRAS

arXiv:2501.17161 [pdf, other]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Authors: Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V. Le, Sergey Levine, Yi Ma

Abstract: Supervised fine-tuning (SFT) and reinforcement learning (RL) are widely used post-training techniques for foundation models. However, their roles in enhancing model generalization capabilities remain unclear. This paper studies the difference between SFT and RL on generalization and memorization, focusing on text-based rule variants and visual variants. We introduce GeneralPoints, an arithmetic re… ▽ More Supervised fine-tuning (SFT) and reinforcement learning (RL) are widely used post-training techniques for foundation models. However, their roles in enhancing model generalization capabilities remain unclear. This paper studies the difference between SFT and RL on generalization and memorization, focusing on text-based rule variants and visual variants. We introduce GeneralPoints, an arithmetic reasoning card game, and adopt V-IRL, a real-world navigation environment, to assess how models trained with SFT and RL generalize to unseen variants in both textual and visual domains. We show that RL, especially when trained with an outcome-based reward, generalizes across both rule-based textual and visual variants. SFT, in contrast, tends to memorize training data and struggles to generalize out-of-distribution scenarios. Further analysis reveals that RL improves the model's underlying visual recognition capabilities, contributing to its enhanced generalization in the visual domain. Despite RL's superior generalization, we show that SFT remains essential for effective RL training; SFT stabilizes the model's output format, enabling subsequent RL to achieve its performance gains. These findings demonstrates the capability of RL for acquiring generalizable knowledge in complex, multi-modal tasks. △ Less

Submitted 26 May, 2025; v1 submitted 28 January, 2025; originally announced January 2025.

Comments: Website at https://tianzhechu.com/SFTvsRL

arXiv:2501.09732 [pdf, other]

Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps

Authors: Nanye Ma, Shangyuan Tong, Haolin Jia, Hexiang Hu, Yu-Chuan Su, Mingda Zhang, Xuan Yang, Yandong Li, Tommi Jaakkola, Xuhui Jia, Saining Xie

Abstract: Generative models have made significant impacts across various domains, largely due to their ability to scale during training by increasing data, computational resources, and model size, a phenomenon characterized by the scaling laws. Recent research has begun to explore inference-time scaling behavior in Large Language Models (LLMs), revealing how performance can further improve with additional c… ▽ More Generative models have made significant impacts across various domains, largely due to their ability to scale during training by increasing data, computational resources, and model size, a phenomenon characterized by the scaling laws. Recent research has begun to explore inference-time scaling behavior in Large Language Models (LLMs), revealing how performance can further improve with additional computation during inference. Unlike LLMs, diffusion models inherently possess the flexibility to adjust inference-time computation via the number of denoising steps, although the performance gains typically flatten after a few dozen. In this work, we explore the inference-time scaling behavior of diffusion models beyond increasing denoising steps and investigate how the generation performance can further improve with increased computation. Specifically, we consider a search problem aimed at identifying better noises for the diffusion sampling process. We structure the design space along two axes: the verifiers used to provide feedback, and the algorithms used to find better noise candidates. Through extensive experiments on class-conditioned and text-conditioned image generation benchmarks, our findings reveal that increasing inference-time compute leads to substantial improvements in the quality of samples generated by diffusion models, and with the complicated nature of images, combinations of the components in the framework can be specifically chosen to conform with different application scenario. △ Less

Submitted 16 January, 2025; originally announced January 2025.

arXiv:2501.05075 [pdf, other]

A Text-Based Knowledge-Embedded Soft Sensing Modeling Approach for General Industrial Process Tasks Based on Large Language Model

Authors: Shuo Tong, Han Liu, Runyuan Guo, Xueqiong Tian, Wenqing Wang, Ding Liu, Youmin Zhang

Abstract: Data-driven soft sensors (DDSS) have become mainstream methods for predicting key performance indicators in process industries. However, DDSS development requires complex and costly customized designs tailored to various tasks during the modeling process. Moreover, DDSS are constrained to a single structured data modality, limiting their ability to incorporate additional contextual knowledge. Furt… ▽ More Data-driven soft sensors (DDSS) have become mainstream methods for predicting key performance indicators in process industries. However, DDSS development requires complex and costly customized designs tailored to various tasks during the modeling process. Moreover, DDSS are constrained to a single structured data modality, limiting their ability to incorporate additional contextual knowledge. Furthermore, DDSSs' limited representation learning leads to weak predictive performance with scarce data. To address these challenges, we propose a general framework named LLM-TKESS (large language model for text-based knowledge-embedded soft sensing), harnessing the powerful general problem-solving capabilities, cross-modal knowledge transfer abilities, and few-shot capabilities of LLM for enhanced soft sensing modeling. Specifically, an auxiliary variable series encoder (AVS Encoder) is proposed to unleash LLM's potential for capturing temporal relationships within series and spatial semantic relationships among auxiliary variables. Then, we propose a two-stage fine-tuning alignment strategy: in the first stage, employing parameter-efficient fine-tuning through autoregressive training adjusts LLM to rapidly accommodate process variable data, resulting in a soft sensing foundation model (SSFM). Subsequently, by training adapters, we adapt the SSFM to various downstream tasks without modifying its architecture. Then, we propose two text-based knowledge-embedded soft sensors, integrating new natural language modalities to overcome the limitations of pure structured data models. Furthermore, benefiting from LLM's pre-existing world knowledge, our model demonstrates outstanding predictive capabilities in small sample conditions. Using the thermal deformation of air preheater rotor as a case study, we validate through extensive experiments that LLM-TKESS exhibits outstanding performance. △ Less

Submitted 9 January, 2025; originally announced January 2025.

arXiv:2501.03295 [pdf]

A Soft Sensor Method with Uncertainty-Awareness and Self-Explanation Based on Large Language Models Enhanced by Domain Knowledge Retrieval

Authors: Shuo Tong, Han Liu, Runyuan Guo, Wenqing Wang, Xueqiong Tian, Lingyun Wei, Lin Zhang, Huayong Wu, Ding Liu, Youmin Zhang

Abstract: Data-driven soft sensors are crucial in predicting key performance indicators in industrial systems. However, current methods predominantly rely on the supervised learning paradigms of parameter updating, which inherently faces challenges such as high development costs, poor robustness, training instability, and lack of interpretability. Recently, large language models (LLMs) have demonstrated sig… ▽ More Data-driven soft sensors are crucial in predicting key performance indicators in industrial systems. However, current methods predominantly rely on the supervised learning paradigms of parameter updating, which inherently faces challenges such as high development costs, poor robustness, training instability, and lack of interpretability. Recently, large language models (LLMs) have demonstrated significant potential across various domains, notably through In-Context Learning (ICL), which enables high-performance task execution with minimal input-label demonstrations and no prior training. This paper aims to replace supervised learning with the emerging ICL paradigm for soft sensor modeling to address existing challenges and explore new avenues for advancement. To achieve this, we propose a novel framework called the Few-shot Uncertainty-aware and self-Explaining Soft Sensor (LLM-FUESS), which includes the Zero-shot Auxiliary Variable Selector (LLM-ZAVS) and the Uncertainty-aware Few-shot Soft Sensor (LLM-UFSS). The LLM-ZAVS retrieves from the Industrial Knowledge Vector Storage to enhance LLMs' domain-specific knowledge, enabling zero-shot auxiliary variable selection. In the LLM-UFSS, we utilize text-based context demonstrations of structured data to prompt LLMs to execute ICL for predicting and propose a context sample retrieval augmentation strategy to improve performance. Additionally, we explored LLMs' AIGC and probabilistic characteristics to propose self-explanation and uncertainty quantification methods for constructing a trustworthy soft sensor. Extensive experiments demonstrate that our method achieved state-of-the-art predictive performance, strong robustness, and flexibility, effectively mitigates training instability found in traditional methods. To the best of our knowledge, this is the first work to establish soft sensor utilizing LLMs. △ Less

Submitted 7 January, 2025; v1 submitted 6 January, 2025; originally announced January 2025.

arXiv:2412.14164 [pdf, other]

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

Authors: Shengbang Tong, David Fan, Jiachen Zhu, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, Zhuang Liu

Abstract: In this work, we propose Visual-Predictive Instruction Tuning (VPiT) - a simple and effective extension to visual instruction tuning that enables a pretrained LLM to quickly morph into an unified autoregressive model capable of generating both text and visual tokens. VPiT teaches an LLM to predict discrete text tokens and continuous visual tokens from any input sequence of image and text data cura… ▽ More In this work, we propose Visual-Predictive Instruction Tuning (VPiT) - a simple and effective extension to visual instruction tuning that enables a pretrained LLM to quickly morph into an unified autoregressive model capable of generating both text and visual tokens. VPiT teaches an LLM to predict discrete text tokens and continuous visual tokens from any input sequence of image and text data curated in an instruction-following format. Our empirical investigation reveals several intriguing properties of VPiT: (1) visual generation ability emerges as a natural byproduct of improved visual understanding, and can be unlocked efficiently with a small amount of generation data; (2) while we find understanding and generation to be mutually beneficial, understanding data contributes to both capabilities more effectively than generation data. Building upon these findings, we train our MetaMorph model and achieve competitive performance on both visual understanding and generation. In visual generation, MetaMorph can leverage the world knowledge and reasoning abilities gained from LLM pretraining, and overcome common failure modes exhibited by other generation models. Our results suggest that LLMs may have strong "prior" vision capabilities that can be efficiently adapted to both visual understanding and generation with a relatively simple instruction tuning process. △ Less

Submitted 18 December, 2024; originally announced December 2024.

Comments: Project page at tsb0601.github.io/metamorph

arXiv:2412.01711 [pdf, other]

Towards Resource Efficient and Interpretable Bias Mitigation in Large Language Models

Authors: Schrasing Tong, Eliott Zemour, Rawisara Lohanimit, Lalana Kagal

Abstract: Although large language models (LLMs) have demonstrated their effectiveness in a wide range of applications, they have also been observed to perpetuate unwanted biases present in the training data, potentially leading to harm for marginalized communities. In this paper, we mitigate bias by leveraging small biased and anti-biased expert models to obtain a debiasing signal that will be added to the… ▽ More Although large language models (LLMs) have demonstrated their effectiveness in a wide range of applications, they have also been observed to perpetuate unwanted biases present in the training data, potentially leading to harm for marginalized communities. In this paper, we mitigate bias by leveraging small biased and anti-biased expert models to obtain a debiasing signal that will be added to the LLM output at decoding-time. This approach combines resource efficiency with interpretability and can be optimized for mitigating specific types of bias, depending on the target use case. Experiments on mitigating gender, race, and religion biases show a reduction in bias on several local and global bias metrics while preserving language model performance. △ Less

Submitted 2 December, 2024; originally announced December 2024.

Comments: 38th Conference on Neural Information Processing Systems (NeurIPS 2024) Safe Generative AI Workshop

arXiv:2411.16039 [pdf]

Label-Free Intraoperative Mean-Transition-Time Image Generation Using Statistical Gating and Deep Learning

Authors: Yan Shi, Denghui Zhao, Jingyi Yu, Wei Ni, Pengcheng Li, Yun Gu, Peng Miao, Shanbao Tong

Abstract: It is of paramount importance to visualize blood dynamics intraoperatively, as this enables the accurate diagnosis of intraoperative conditions and facilitates informed surgical decision-making. Indocyanine green (ICG) fluorescence imaging represents the gold standard for the assessment of blood flow and the identification of vascular structures. However, it has several disadvantages, including ti… ▽ More It is of paramount importance to visualize blood dynamics intraoperatively, as this enables the accurate diagnosis of intraoperative conditions and facilitates informed surgical decision-making. Indocyanine green (ICG) fluorescence imaging represents the gold standard for the assessment of blood flow and the identification of vascular structures. However, it has several disadvantages, including time-consuming data acquisition, mandatory waiting periods, potential allergic reactions, and complex operations. Laser speckle contrast imaging (LSCI) provides an alternative for label-free assessment of blood flow; however, it lacks the necessary information for distinguishing arteries from veins and determining blood flow direction. Such information may be inferred from a Mean Transition Time (MTT) image derived from fluorescence imaging. In order to address these challenges, we propose the implementation of a Mixed Attention Dense UNet (MA-DenseUNet), which will be used to generate synthetic MTT images based on statistically gated deep tissue contrast and white light images. The proposed method provides clear delineation of vasculature, differentiation of arteries and veins, decoding of blood flow direction, and a reduction in imaging time by a minimum of 97.69%. This study demonstrates the potential of deep learning to optimize intraoperative optical imaging techniques, thereby providing faster and more efficient label-free surgical guidance. △ Less

Submitted 24 November, 2024; originally announced November 2024.

arXiv:2410.19560 [pdf, other]

Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised Learning

Authors: Shentong Mo, Shengbang Tong

Abstract: In recent advancements in unsupervised visual representation learning, the Joint-Embedding Predictive Architecture (JEPA) has emerged as a significant method for extracting visual features from unlabeled imagery through an innovative masking strategy. Despite its success, two primary limitations have been identified: the inefficacy of Exponential Moving Average (EMA) from I-JEPA in preventing enti… ▽ More In recent advancements in unsupervised visual representation learning, the Joint-Embedding Predictive Architecture (JEPA) has emerged as a significant method for extracting visual features from unlabeled imagery through an innovative masking strategy. Despite its success, two primary limitations have been identified: the inefficacy of Exponential Moving Average (EMA) from I-JEPA in preventing entire collapse and the inadequacy of I-JEPA prediction in accurately learning the mean of patch representations. Addressing these challenges, this study introduces a novel framework, namely C-JEPA (Contrastive-JEPA), which integrates the Image-based Joint-Embedding Predictive Architecture with the Variance-Invariance-Covariance Regularization (VICReg) strategy. This integration is designed to effectively learn the variance/covariance for preventing entire collapse and ensuring invariance in the mean of augmented views, thereby overcoming the identified limitations. Through empirical and theoretical evaluations, our work demonstrates that C-JEPA significantly enhances the stability and quality of visual representation learning. When pre-trained on the ImageNet-1K dataset, C-JEPA exhibits rapid and improved convergence in both linear probing and fine-tuning performance metrics. △ Less

Submitted 25 October, 2024; originally announced October 2024.

arXiv:2410.13669 [pdf]

Theta and/or alpha? Neural oscillational substrates for dynamic inter-brain synchrony during mother-child cooperation

Authors: Jiayang Xu, Yamin Li, Ruxin Su, Saishuang Wu, Chengcheng Wu, Haiwa Wang, Qi Zhu, Yue Fang, Fan Jiang, Shanbao Tong, Yunting Zhang, Xiaoli Guo

Abstract: Mother-child interaction is a highly dynamic process neurally characterized by inter-brain synchrony (IBS) at θ and/or α rhythms. However, their establishment, dynamic changes, and roles in mother-child interactions remain unknown. Through dynamic analysis of dual-EEG from 40 mother-child dyads during turn-taking cooperation, we uncover that θ-IBS and α-IBS alternated with interactive behaviors, w… ▽ More Mother-child interaction is a highly dynamic process neurally characterized by inter-brain synchrony (IBS) at θ and/or α rhythms. However, their establishment, dynamic changes, and roles in mother-child interactions remain unknown. Through dynamic analysis of dual-EEG from 40 mother-child dyads during turn-taking cooperation, we uncover that θ-IBS and α-IBS alternated with interactive behaviors, with EEG frequency-shift as a prerequisite for IBS transitions. When mothers attempt to track their children's attention and/or predict their intentions, they will adjust their EEG frequencies to align with their children's θ oscillations, leading to a higher occurrence of the θ-IBS state. Conversely, the α-IBS state, accompanied by the EEG frequency-shift to the α range, is more prominent during mother-led interactions. Further exploratory analysis reveals greater presence and stability of the θ-IBS state during cooperative than non-cooperative conditions, particularly in dyads with stronger emotional attachments and more frequent interactions in their daily lives. Our findings shed light on the neural oscillational substrates underlying the IBS dynamics during mother-child interactions. △ Less

Submitted 30 October, 2024; v1 submitted 17 October, 2024; originally announced October 2024.

Comments: 27 Pages,6 figures

arXiv:2410.00269 [pdf, other]

Learning to Reconstruct Quirky Tracks

Authors: Qiyu Sha, Daniel Murnane, Max Fieg, Shelley Tong, Mark Zakharyan, Yaquan Fang, Daniel Whiteson

Abstract: Analysis of data from particle physics experiments traditionally sacrifices some sensitivity to new particles for the sake of practical computability, effectively ignoring some potentially striking signatures. However, recent advances in ML-based tracking allow for new inroads into previously inaccessible territory, such as reconstruction of tracks which do not follow helical trajectories. This pa… ▽ More Analysis of data from particle physics experiments traditionally sacrifices some sensitivity to new particles for the sake of practical computability, effectively ignoring some potentially striking signatures. However, recent advances in ML-based tracking allow for new inroads into previously inaccessible territory, such as reconstruction of tracks which do not follow helical trajectories. This paper presents a demonstration of the capacity of ML-based tracking to reconstruct the oscillating trajectories of quirks. The technique used is not specific to quirks, and opens the door to a program of searching for many kinds of non-standard tracks. △ Less

Submitted 30 September, 2024; originally announced October 2024.

arXiv:2409.06184 [pdf, other]

A Policy Iteration Method for Inverse Mean Field Games

Authors: Kui Ren, Nathan Soedjak, Shanyin Tong

Abstract: We propose a policy iteration method to solve an inverse problem for a mean-field game (MFG) model, specifically to reconstruct the obstacle function in the game from the partial observation data of value functions, which represent the optimal costs for agents. The proposed approach decouples this complex inverse problem, which is an optimization problem constrained by a coupled nonlinear forward… ▽ More We propose a policy iteration method to solve an inverse problem for a mean-field game (MFG) model, specifically to reconstruct the obstacle function in the game from the partial observation data of value functions, which represent the optimal costs for agents. The proposed approach decouples this complex inverse problem, which is an optimization problem constrained by a coupled nonlinear forward and backward PDE system in the MFG, into several iterations of solving linear PDEs and linear inverse problems. This method can also be viewed as a fixed-point iteration that simultaneously solves the MFG system and inversion. We prove its linear rate of convergence. In addition, numerical examples in 1D and 2D, along with performance comparisons to a direct least-squares method, demonstrate the superior efficiency and accuracy of the proposed method for solving inverse MFGs. △ Less

Submitted 15 April, 2025; v1 submitted 9 September, 2024; originally announced September 2024.

MSC Class: 35Q89; 35R30; 49L12; 49M41; 49N45 49N80; 65K10; 91A16

arXiv:2409.02813 [pdf, other]

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

Authors: Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, Graham Neubig

Abstract: This paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark. MMMU-Pro rigorously assesses multimodal models' true understanding and reasoning capabilities through a three-step process based on MMMU: (1) filtering out questions answerable by text-only models, (2) augmenting candidate options, and (3) introducing a vision-o… ▽ More This paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark. MMMU-Pro rigorously assesses multimodal models' true understanding and reasoning capabilities through a three-step process based on MMMU: (1) filtering out questions answerable by text-only models, (2) augmenting candidate options, and (3) introducing a vision-only input setting where questions are embedded within images. This setting challenges AI to truly "see" and "read" simultaneously, testing a fundamental human cognitive skill of seamlessly integrating visual and textual information. Results show that model performance is substantially lower on MMMU-Pro than on MMMU, ranging from 16.8% to 26.9% across models. We explore the impact of OCR prompts and Chain of Thought (CoT) reasoning, finding that OCR prompts have minimal effect while CoT generally improves performance. MMMU-Pro provides a more rigorous evaluation tool, closely mimicking real-world scenarios and offering valuable directions for future research in multimodal AI. △ Less

Submitted 22 May, 2025; v1 submitted 4 September, 2024; originally announced September 2024.

Comments: ACL 2025 Main

arXiv:2408.03496 [pdf, other]

A three-stage method for reconstructing multiple coefficients in coupled photoacoustic and diffuse optical imaging

Authors: Yinxi Pan, Kui Ren, Shanyin Tong

Abstract: This paper studies inverse problems in quantitative photoacoustic tomography with additional optical current data supplemented from diffuse optical tomography. We propose a three-stage image reconstruction method for the simultaneous recovery of the absorption, diffusion, and Grüneisen coefficients. We demonstrate, through numerical simulations, that: (i) when the Grüneisen coefficient is known, t… ▽ More This paper studies inverse problems in quantitative photoacoustic tomography with additional optical current data supplemented from diffuse optical tomography. We propose a three-stage image reconstruction method for the simultaneous recovery of the absorption, diffusion, and Grüneisen coefficients. We demonstrate, through numerical simulations, that: (i) when the Grüneisen coefficient is known, the addition of the optical measurements allows a more accurate reconstruction of the scattering and absorption coefficients; and (ii) when the Grüneisen coefficient is not known, the addition of optical current measurements allows us to reconstruct uniquely the Grüneisen, the scattering and absorption coefficients. Numerical simulations based on synthetic data are presented to demonstrate the effectiveness of the proposed idea. △ Less

Submitted 23 January, 2025; v1 submitted 6 August, 2024; originally announced August 2024.

MSC Class: 35J47; 35R30; 49M15; 65M32; 78A46; 78A60; 78A70; 80A23; 92C55; 94A08

arXiv:2407.12209 [pdf, other]

doi 10.1093/mnras/stae1748

A question of personalities: evolution of viscous and wind-driven protoplanetary discs in the presence of dead zones

Authors: Simin Tong, Richard Alexander, Giovanni Rosotti

Abstract: Whether the angular momentum of protoplanetary discs is redistributed by viscosity or extracted by magnetised winds is a long-standing question. Demographic indicators, such as gas disc sizes and stellar accretion rates, have been proposed as ways of distinguishing between these two mechanisms. In this paper, we implement one-dimensional gas simulations to study the evolution of "hybrid" protoplan… ▽ More Whether the angular momentum of protoplanetary discs is redistributed by viscosity or extracted by magnetised winds is a long-standing question. Demographic indicators, such as gas disc sizes and stellar accretion rates, have been proposed as ways of distinguishing between these two mechanisms. In this paper, we implement one-dimensional gas simulations to study the evolution of "hybrid" protoplanetary discs simultaneously driven by viscosity and magnetised winds, with dead zones present. We explore how the variations of disc properties, including initial disc sizes, dead zone sizes and angular momentum transport efficiency, affect stellar accretion rates, disc surface density profiles, disc sizes, disc lifetimes, and cumulative mass loss by different processes. Our models show that the expansion of the gas disc size can be sustained when the majority of angular momentum is removed by the magnetised wind for individual protoplanetary discs. However, when we can only observe discs via demographic screenshots, the variation of disc sizes with time is possibly diminished by the disc "personalities", by which we mean the variations of initial disc properties among different discs. Our "hybrid" models re-assess association of the two demographic indicators with mechanisms responsible for angular momentum transport and suggest additional diagnostics are required to assist the differentiation. △ Less

Submitted 16 July, 2024; originally announced July 2024.

Comments: 23 pages, 17 figures. Accepted for publication in MNRAS

arXiv:2406.16860 [pdf, other]

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

Authors: Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Ziteng Wang, Rob Fergus, Yann LeCun, Saining Xie

Abstract: We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach. While stronger language models can enhance multimodal capabilities, the design choices for vision components are often insufficiently explored and disconnected from visual representation learning research. This gap hinders accurate sensory grounding in real-world scenarios. Our study uses LLMs and… ▽ More We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach. While stronger language models can enhance multimodal capabilities, the design choices for vision components are often insufficiently explored and disconnected from visual representation learning research. This gap hinders accurate sensory grounding in real-world scenarios. Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations, offering new insights into different models and architectures -- self-supervised, strongly supervised, or combinations thereof -- based on experiments with over 20 vision encoders. We critically examine existing MLLM benchmarks, address the difficulties involved in consolidating and interpreting results from various tasks, and introduce a new vision-centric benchmark, CV-Bench. To further improve visual grounding, we propose the Spatial Vision Aggregator (SVA), a dynamic and spatially-aware connector that integrates high-resolution vision features with LLMs while reducing the number of tokens. Additionally, we discuss the curation of high-quality visual instruction-tuning data from publicly available sources, emphasizing the importance of data source balancing and distribution ratio. Collectively, Cambrian-1 not only achieves state-of-the-art performance but also serves as a comprehensive, open cookbook for instruction-tuned MLLMs. We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes. We hope our release will inspire and accelerate advancements in multimodal systems and visual representation learning. △ Less

Submitted 4 December, 2024; v1 submitted 24 June, 2024; originally announced June 2024.

Comments: NeurIPS 2024 (Oral). Website at https://cambrian-mllm.github.io

arXiv:2406.01276 [pdf, other]

EduNLP: Towards a Unified and Modularized Library for Educational Resources

Authors: Zhenya Huang, Yuting Ning, Longhu Qin, Shiwei Tong, Shangzi Xue, Tong Xiao, Xin Lin, Jiayu Liu, Qi Liu, Enhong Chen, Shijing Wang

Abstract: Educational resource understanding is vital to online learning platforms, which have demonstrated growing applications recently. However, researchers and developers always struggle with using existing general natural language toolkits or domain-specific models. The issue raises a need to develop an effective and easy-to-use one that benefits AI education-related research and applications. To bridg… ▽ More Educational resource understanding is vital to online learning platforms, which have demonstrated growing applications recently. However, researchers and developers always struggle with using existing general natural language toolkits or domain-specific models. The issue raises a need to develop an effective and easy-to-use one that benefits AI education-related research and applications. To bridge this gap, we present a unified, modularized, and extensive library, EduNLP, focusing on educational resource understanding. In the library, we decouple the whole workflow to four key modules with consistent interfaces including data configuration, processing, model implementation, and model evaluation. We also provide a configurable pipeline to unify the data usage and model usage in standard ways, where users can customize their own needs. For the current version, we primarily provide 10 typical models from four categories, and 5 common downstream-evaluation tasks in the education domain on 8 subjects for users' usage. The project is released at: https://github.com/bigdata-ustc/EduNLP. △ Less

Submitted 4 June, 2024; v1 submitted 3 June, 2024; originally announced June 2024.

arXiv:2405.10292 [pdf, other]

Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning

Authors: Yuexiang Zhai, Hao Bai, Zipeng Lin, Jiayi Pan, Shengbang Tong, Yifei Zhou, Alane Suhr, Saining Xie, Yann LeCun, Yi Ma, Sergey Levine

Abstract: Large vision-language models (VLMs) fine-tuned on specialized visual instruction-following data have exhibited impressive language reasoning capabilities across various scenarios. However, this fine-tuning paradigm may not be able to efficiently learn optimal decision-making agents in multi-step goal-directed tasks from interactive environments. To address this challenge, we propose an algorithmic… ▽ More Large vision-language models (VLMs) fine-tuned on specialized visual instruction-following data have exhibited impressive language reasoning capabilities across various scenarios. However, this fine-tuning paradigm may not be able to efficiently learn optimal decision-making agents in multi-step goal-directed tasks from interactive environments. To address this challenge, we propose an algorithmic framework that fine-tunes VLMs with reinforcement learning (RL). Specifically, our framework provides a task description and then prompts the VLM to generate chain-of-thought (CoT) reasoning, enabling the VLM to efficiently explore intermediate reasoning steps that lead to the final text-based action. Next, the open-ended text output is parsed into an executable action to interact with the environment to obtain goal-directed task rewards. Finally, our framework uses these task rewards to fine-tune the entire VLM with RL. Empirically, we demonstrate that our proposed framework enhances the decision-making capabilities of VLM agents across various tasks, enabling 7b models to outperform commercial models such as GPT4-V or Gemini. Furthermore, we find that CoT reasoning is a crucial component for performance improvement, as removing the CoT reasoning results in a significant decrease in the overall performance of our method. △ Less

Submitted 7 October, 2024; v1 submitted 16 May, 2024; originally announced May 2024.

arXiv:2405.07437 [pdf, other]

doi 10.1007/978-981-96-1024-2_8

Evaluation of Retrieval-Augmented Generation: A Survey

Authors: Hao Yu, Aoran Gan, Kai Zhang, Shiwei Tong, Qi Liu, Zhaofeng Liu

Abstract: Retrieval-Augmented Generation (RAG) has recently gained traction in natural language processing. Numerous studies and real-world applications are leveraging its ability to enhance generative models through external information retrieval. Evaluating these RAG systems, however, poses unique challenges due to their hybrid structure and reliance on dynamic knowledge sources. To better understand thes… ▽ More Retrieval-Augmented Generation (RAG) has recently gained traction in natural language processing. Numerous studies and real-world applications are leveraging its ability to enhance generative models through external information retrieval. Evaluating these RAG systems, however, poses unique challenges due to their hybrid structure and reliance on dynamic knowledge sources. To better understand these challenges, we conduct A Unified Evaluation Process of RAG (Auepora) and aim to provide a comprehensive overview of the evaluation and benchmarks of RAG systems. Specifically, we examine and compare several quantifiable metrics of the Retrieval and Generation components, such as relevance, accuracy, and faithfulness, within the current RAG benchmarks, encompassing the possible output and ground truth pairs. We then analyze the various datasets and metrics, discuss the limitations of current benchmarks, and suggest potential directions to advance the field of RAG benchmarks. △ Less

Submitted 3 July, 2024; v1 submitted 12 May, 2024; originally announced May 2024.

arXiv:2404.13918 [pdf]

Emerging Advancements in 6G NTN Radio Access Technologies: An Overview

Authors: Husnain Shahid, Carla Amatetti, Riccardo Campana, Sorya Tong, Dorin Panaitopol, Alessandro Vanelli Coralli, Abdelhamed Mohamed, Chao Zhang, Ebraam Khalifa, Eduardo Medeiros, Estefania Recayte, Fatemeh Ghasemifard, Ji Lianghai, Juan Bucheli, Karthik Anantha Swamy, Marius Caus, Mehmet Gurelli, Miguel A. Vazquez, Musbah Shaat, Nathan Borios, Per-Erik Eriksson, Sebastian Euler, Zheng Li, Xiaotian Fu

Abstract: The efforts on the development, standardization and improvements to communication systems towards 5G Advanced and 6G are on track to provide benefits such as an unprecedented level of connectivity and performance, enabling a diverse range of vertical services. The full integration of non-terrestrial components into 6G plays a pivotal role in realizing this paradigm shift towards ubiquitous communi… ▽ More The efforts on the development, standardization and improvements to communication systems towards 5G Advanced and 6G are on track to provide benefits such as an unprecedented level of connectivity and performance, enabling a diverse range of vertical services. The full integration of non-terrestrial components into 6G plays a pivotal role in realizing this paradigm shift towards ubiquitous communication and global coverage. However, this integration into 6G brings forth a set of its own challenges, particularly in Radio Access Technologies (RATs). To this end, this paper comprehensively discusses those challenges at different levels of RATs and proposes the corresponding potential emerging advancements in the realm of 6G NTN. In particular, the focus is on advancing the prospective aspects of Radio Resource Management (RRM), spectral coexistence in terrestrial and non-terrestrial components and flexible waveform design solutions to combat the impediments. This discussion with a specific focus on emerging advancements in 6G NTN RATs is critical for shaping the next generation networks and potentially relevant in contributing the part in standardization in forthcoming releases △ Less

Submitted 22 April, 2024; originally announced April 2024.

Comments: accepted in 2024 EuCNC and 6G Summit, Antwerp, Belgium, 3_6 June 2024

arXiv:2404.09567 [pdf, other]

A competitive game optimization algorithm for Unmanned Aerial Vehicle path planning

Authors: Tai-shan Lou, Guang-sheng Guan, Zhe-peng Yue, Yu Wang, Ren-long Qi, Shi-hao Tong

Abstract: To solve the Unmanned Aerial Vehicle (UAV) path planning problem, a meta-heuristic optimization algorithm called competitive game optimizer (CGO) is proposed. In the CGO model, three phases of exploration and exploitation, and candidate replacement, are established, corresponding to the player's search for supplies and combat, and the movement toward a safe zone. In the algorithm exploration phase… ▽ More To solve the Unmanned Aerial Vehicle (UAV) path planning problem, a meta-heuristic optimization algorithm called competitive game optimizer (CGO) is proposed. In the CGO model, three phases of exploration and exploitation, and candidate replacement, are established, corresponding to the player's search for supplies and combat, and the movement toward a safe zone. In the algorithm exploration phase, Levy flight is introduced to improve the global convergence of the algorithm. The encounter probability which adaptively changes with the number of iterations is also introduced in the CGO. The balance between exploration and exploitation of solution space of optimization problem is realized, and each step is described and modeled mathematically. The performance of the CGO was evaluated on a set of 41 test functions taken from CEC2017 and CEC2022. It was then compared with eight widely recognized meta-heuristic optimization algorithms. The simulation results demonstrate that the proposed algorithm successfully achieves a balanced trade-off between exploration and exploitation, showcasing remarkable advantages when compared to seven classical algorithms. In addition, in order to further verify the effectiveness of the CGO, the CGO is applied to 8 practical engineering design problems and UAV path planning, and the results show that the CGO has strong performance in dealing with these practical optimization problems, and has a good application prospect. △ Less

Submitted 15 April, 2024; originally announced April 2024.

arXiv:2404.05575 [pdf]

Prediction of topotactic transition from black to blue phosphorus induced by surface Br adsorption

Authors: Hao Tian, Wenjun Xie, Maohai Xie, Chuanhui Zhu, Hu Xu, Shuk-Yin Tong

Abstract: Based on first-principles calculations, we propose a potential access to the yet unrealized freestanding blue phosphorus (blueP) through transformation of black phosphorus (blackP) induced by surface bromine (Br) adsorption. Formation of the Br-P bonds disrupts the original sp3 configurations in blackP, generates unpaired pz electrons and induces a structural transformation that results in blueP f… ▽ More Based on first-principles calculations, we propose a potential access to the yet unrealized freestanding blue phosphorus (blueP) through transformation of black phosphorus (blackP) induced by surface bromine (Br) adsorption. Formation of the Br-P bonds disrupts the original sp3 configurations in blackP, generates unpaired pz electrons and induces a structural transformation that results in blueP formation by re-pairing the pz orbitals. Ab initio molecular dynamics simulations confirm that randomly adsorbed Br adatoms on bilayer blackP spontaneously diffuse into specific patterns to render the emergence of the blueP phase. The expected obtainment Br-passivated blueP nanoribbons exhibit tunable band gaps in a wide range and high carrier mobilities of the order of 1000 cm2V-1s-1. This study provides an opportunity to fabricate blueP through the conversion from blackP by tuning its surface chemistry. △ Less

Submitted 8 April, 2024; originally announced April 2024.

arXiv:2403.10953 [pdf, other]

Ctrl123: Consistent Novel View Synthesis via Closed-Loop Transcription

Authors: Hongxiang Zhao, Xili Dai, Jianan Wang, Shengbang Tong, Jingyuan Zhang, Weida Wang, Lei Zhang, Yi Ma

Abstract: Large image diffusion models have demonstrated zero-shot capability in novel view synthesis (NVS). However, existing diffusion-based NVS methods struggle to generate novel views that are accurately consistent with the corresponding ground truth poses and appearances, even on the training set. This consequently limits the performance of downstream tasks, such as image-to-multiview generation and 3D… ▽ More Large image diffusion models have demonstrated zero-shot capability in novel view synthesis (NVS). However, existing diffusion-based NVS methods struggle to generate novel views that are accurately consistent with the corresponding ground truth poses and appearances, even on the training set. This consequently limits the performance of downstream tasks, such as image-to-multiview generation and 3D reconstruction. We realize that such inconsistency is largely due to the fact that it is difficult to enforce accurate pose and appearance alignment directly in the diffusion training, as mostly done by existing methods such as Zero123. To remedy this problem, we propose Ctrl123, a closed-loop transcription-based NVS diffusion method that enforces alignment between the generated view and ground truth in a pose-sensitive feature space. Our extensive experiments demonstrate the effectiveness of Ctrl123 on the tasks of NVS and 3D reconstruction, achieving significant improvements in both multiview-consistency and pose-consistency over existing methods. △ Less

Submitted 21 June, 2024; v1 submitted 16 March, 2024; originally announced March 2024.

arXiv:2402.14424 [pdf]

doi 10.1057/s41599-024-03407-5

Automating psychological hypothesis generation with AI: when large language models meet causal graph

Authors: Song Tong, Kai Mao, Zhen Huang, Yukun Zhao, Kaiping Peng

Abstract: Leveraging the synergy between causal knowledge graphs and a large language model (LLM), our study introduces a groundbreaking approach for computational hypothesis generation in psychology. We analyzed 43,312 psychology articles using a LLM to extract causal relation pairs. This analysis produced a specialized causal graph for psychology. Applying link prediction algorithms, we generated 130 pote… ▽ More Leveraging the synergy between causal knowledge graphs and a large language model (LLM), our study introduces a groundbreaking approach for computational hypothesis generation in psychology. We analyzed 43,312 psychology articles using a LLM to extract causal relation pairs. This analysis produced a specialized causal graph for psychology. Applying link prediction algorithms, we generated 130 potential psychological hypotheses focusing on `well-being', then compared them against research ideas conceived by doctoral scholars and those produced solely by the LLM. Interestingly, our combined approach of a LLM and causal graphs mirrored the expert-level insights in terms of novelty, clearly surpassing the LLM-only hypotheses (t(59) = 3.34, p=0.007 and t(59) = 4.32, p<0.001, respectively). This alignment was further corroborated using deep semantic analysis. Our results show that combining LLM with machine learning techniques such as causal knowledge graphs can revolutionize automated discovery in psychology, extracting novel insights from the extensive literature. This work stands at the crossroads of psychology and artificial intelligence, championing a new enriched paradigm for data-driven hypothesis generation in psychological research. △ Less

Submitted 15 July, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

Journal ref: Humanities and Social Sciences Communications, (2024) 11:896

arXiv:2402.02065 [pdf, other]

Training Implicit Networks for Image Deblurring using Jacobian-Free Backpropagation

Authors: Linghai Liu, Shuaicheng Tong, Lisa Zhao

Abstract: Recent efforts in applying implicit networks to solve inverse problems in imaging have achieved competitive or even superior results when compared to feedforward networks. These implicit networks only require constant memory during backpropagation, regardless of the number of layers. However, they are not necessarily easy to train. Gradient calculations are computationally expensive because they r… ▽ More Recent efforts in applying implicit networks to solve inverse problems in imaging have achieved competitive or even superior results when compared to feedforward networks. These implicit networks only require constant memory during backpropagation, regardless of the number of layers. However, they are not necessarily easy to train. Gradient calculations are computationally expensive because they require backpropagating through a fixed point. In particular, this process requires solving a large linear system whose size is determined by the number of features in the fixed point iteration. This paper explores a recently proposed method, Jacobian-free Backpropagation (JFB), a backpropagation scheme that circumvents such calculation, in the context of image deblurring problems. Our results show that JFB is comparable against fine-tuned optimization schemes, state-of-the-art (SOTA) feedforward networks, and existing implicit networks at a reduced computational cost. △ Less

Submitted 3 February, 2024; originally announced February 2024.

arXiv:2401.06209 [pdf, other]

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Authors: Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, Saining Xie

Abstract: Is vision good enough for language? Recent advancements in multimodal models primarily stem from the powerful reasoning abilities of large language models (LLMs). However, the visual component typically depends only on the instance-level contrastive language-image pre-training (CLIP). Our research reveals that the visual capabilities in recent multimodal LLMs (MLLMs) still exhibit systematic short… ▽ More Is vision good enough for language? Recent advancements in multimodal models primarily stem from the powerful reasoning abilities of large language models (LLMs). However, the visual component typically depends only on the instance-level contrastive language-image pre-training (CLIP). Our research reveals that the visual capabilities in recent multimodal LLMs (MLLMs) still exhibit systematic shortcomings. To understand the roots of these errors, we explore the gap between the visual embedding space of CLIP and vision-only self-supervised learning. We identify ''CLIP-blind pairs'' - images that CLIP perceives as similar despite their clear visual differences. With these pairs, we construct the Multimodal Visual Patterns (MMVP) benchmark. MMVP exposes areas where state-of-the-art systems, including GPT-4V, struggle with straightforward questions across nine basic visual patterns, often providing incorrect answers and hallucinated explanations. We further evaluate various CLIP-based vision-and-language models and found a notable correlation between visual patterns that challenge CLIP models and those problematic for multimodal LLMs. As an initial effort to address these issues, we propose a Mixture of Features (MoF) approach, demonstrating that integrating vision self-supervised learning features with MLLMs can significantly enhance their visual grounding capabilities. Together, our research suggests visual representation learning remains an open challenge, and accurate visual grounding is crucial for future successful multimodal systems. △ Less

Submitted 25 April, 2024; v1 submitted 11 January, 2024; originally announced January 2024.

Comments: Project page: https://tsb0601.github.io/mmvp_blog/

arXiv:2401.01519 [pdf]

Exploring the Frontiers of LLMs in Psychological Applications: A Comprehensive Review

Authors: Luoma Ke, Song Tong, Peng Cheng, Kaiping Peng

Abstract: This paper explores the frontiers of large language models (LLMs) in psychology applications. Psychology has undergone several theoretical changes, and the current use of Artificial Intelligence (AI) and Machine Learning, particularly LLMs, promises to open up new research directions. We provide a detailed exploration of how LLMs like ChatGPT are transforming psychological research. It discusses t… ▽ More This paper explores the frontiers of large language models (LLMs) in psychology applications. Psychology has undergone several theoretical changes, and the current use of Artificial Intelligence (AI) and Machine Learning, particularly LLMs, promises to open up new research directions. We provide a detailed exploration of how LLMs like ChatGPT are transforming psychological research. It discusses the impact of LLMs across various branches of psychology, including cognitive and behavioral, clinical and counseling, educational and developmental, and social and cultural psychology, highlighting their potential to simulate aspects of human cognition and behavior. The paper delves into the capabilities of these models to emulate human-like text generation, offering innovative tools for literature review, hypothesis generation, experimental design, experimental subjects, data analysis, academic writing, and peer review in psychology. While LLMs are essential in advancing research methodologies in psychology, the paper also cautions about their technical and ethical challenges. There are issues like data privacy, the ethical implications of using LLMs in psychological research, and the need for a deeper understanding of these models' limitations. Researchers should responsibly use LLMs in psychological studies, adhering to ethical standards and considering the potential consequences of deploying these technologies in sensitive areas. Overall, the article provides a comprehensive overview of the current state of LLMs in psychology, exploring potential benefits and challenges. It serves as a call to action for researchers to leverage LLMs' advantages responsibly while addressing associated risks. △ Less

Submitted 20 April, 2025; v1 submitted 2 January, 2024; originally announced January 2024.

arXiv:2312.11490 [pdf]

Tracking Intrinsic Non-Hermitian Skin Effect in Lossy Lattices

Authors: Liwei Xiong, Qicheng Zhang, Xiling Feng, Yufei Leng, Min Pi, Shuaishuai Tong, Chunyin Qiu

Abstract: Non-Hermitian skin effect (NHSE), characterized by a majority of eigenstates localized at open boundaries, is one of the most iconic phenomena in non-Hermitian lattices. Despite notable experimental studies implemented, most of them witness only certain signs of the NHSE rather than the intrinsic exponential localization inherent in eigenstates, owing to the ubiquitous and inevitable background lo… ▽ More Non-Hermitian skin effect (NHSE), characterized by a majority of eigenstates localized at open boundaries, is one of the most iconic phenomena in non-Hermitian lattices. Despite notable experimental studies implemented, most of them witness only certain signs of the NHSE rather than the intrinsic exponential localization inherent in eigenstates, owing to the ubiquitous and inevitable background loss. Even worse, the experimental observation of the NHSE would be completely obscured in highly lossy cases. Here, we theoretically propose a dual test approach to eliminate the destructive loss effect and track the intrinsic NHSE that is essentially irrelevant to background loss. Experimentally, the effectiveness of this approach is precisely validated by one- and two-dimensional non-Hermitian acoustic lattices. Our study sheds new light on the previously untapped intrinsic aspect of the NHSE, which is of particular significance in non-Hermitian topological physics. △ Less

Submitted 1 December, 2023; originally announced December 2023.

arXiv:2311.13110 [pdf, other]

White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?

Authors: Yaodong Yu, Sam Buchanan, Druv Pai, Tianzhe Chu, Ziyang Wu, Shengbang Tong, Hao Bai, Yuexiang Zhai, Benjamin D. Haeffele, Yi Ma

Abstract: In this paper, we contend that a natural objective of representation learning is to compress and transform the distribution of the data, say sets of tokens, towards a low-dimensional Gaussian mixture supported on incoherent subspaces. The goodness of such a representation can be evaluated by a principled measure, called sparse rate reduction, that simultaneously maximizes the intrinsic information… ▽ More In this paper, we contend that a natural objective of representation learning is to compress and transform the distribution of the data, say sets of tokens, towards a low-dimensional Gaussian mixture supported on incoherent subspaces. The goodness of such a representation can be evaluated by a principled measure, called sparse rate reduction, that simultaneously maximizes the intrinsic information gain and extrinsic sparsity of the learned representation. From this perspective, popular deep network architectures, including transformers, can be viewed as realizing iterative schemes to optimize this measure. Particularly, we derive a transformer block from alternating optimization on parts of this objective: the multi-head self-attention operator compresses the representation by implementing an approximate gradient descent step on the coding rate of the features, and the subsequent multi-layer perceptron sparsifies the features. This leads to a family of white-box transformer-like deep network architectures, named CRATE, which are mathematically fully interpretable. We show, by way of a novel connection between denoising and compression, that the inverse to the aforementioned compressive encoding can be realized by the same class of CRATE architectures. Thus, the so-derived white-box architectures are universal to both encoders and decoders. Experiments show that these networks, despite their simplicity, indeed learn to compress and sparsify representations of large-scale real-world image and text datasets, and achieve performance very close to highly engineered transformer-based models: ViT, MAE, DINO, BERT, and GPT2. We believe the proposed computational framework demonstrates great potential in bridging the gap between theory and practice of deep learning, from a unified perspective of data compression. Code is available at: https://ma-lab-berkeley.github.io/CRATE . △ Less

Submitted 6 September, 2024; v1 submitted 21 November, 2023; originally announced November 2023.

Comments: Accepted at Journal of Machine Learning Research. This paper integrates the works arXiv:2306.01129 and arXiv:2308.16271 into a complete story. In this paper, we improve the writing and organization, and also add conceptual, empirical, and theoretical improvements over the previous work. V2: small typo fixes/formatting improvements. V3: improvements from journal revisions. V4: fix figures

arXiv:2311.00121 [pdf, other]

New Physics in Single Resonant Top Quarks

Authors: Shelley Tong, James Corcoran, Max Fieg, Michael Fenton, Daniel Whiteson

Abstract: Searches for new physics in the top quark sector are of great theoretical interest, yet some powerful avenues for discovery remain unexplored. We characterize the expected statistical power of the LHC dataset to constrain the single production of heavy top partners $T$ decaying to a top quark and a photon or a top quark and a gluon. We describe an effective interaction which could generate such pr… ▽ More Searches for new physics in the top quark sector are of great theoretical interest, yet some powerful avenues for discovery remain unexplored. We characterize the expected statistical power of the LHC dataset to constrain the single production of heavy top partners $T$ decaying to a top quark and a photon or a top quark and a gluon. We describe an effective interaction which could generate such production, though the limits apply to a range of theoretical models. We find sensitivity to cross sections in the $10^{2}-10^{5}$ fb range, for $T$ masses between 300 and 1000 GeV, depending on decay mode. △ Less

Submitted 31 October, 2023; originally announced November 2023.

arXiv:2310.16906 [pdf, other]

doi 10.1615/Int.J.UncertaintyQuantification.2024051416

Sensitivity Analysis of the Information Gain in Infinite-Dimensional Bayesian Linear Inverse Problems

Authors: Abhijit Chowdhary, Shanyin Tong, Georg Stadler, Alen Alexanderian

Abstract: We study the sensitivity of infinite-dimensional Bayesian linear inverse problems governed by partial differential equations (PDEs) with respect to modeling uncertainties. In particular, we consider derivative-based sensitivity analysis of the information gain, as measured by the Kullback-Leibler divergence from the posterior to the prior distribution. To facilitate this, we develop a fast and acc… ▽ More We study the sensitivity of infinite-dimensional Bayesian linear inverse problems governed by partial differential equations (PDEs) with respect to modeling uncertainties. In particular, we consider derivative-based sensitivity analysis of the information gain, as measured by the Kullback-Leibler divergence from the posterior to the prior distribution. To facilitate this, we develop a fast and accurate method for computing derivatives of the information gain with respect to auxiliary model parameters. Our approach combines low-rank approximations, adjoint-based eigenvalue sensitivity analysis, and post-optimal sensitivity analysis. The proposed approach also paves way for global sensitivity analysis by computing derivative-based global sensitivity measures. We illustrate different aspects of the proposed approach using an inverse problem governed by a scalar linear elliptic PDE, and an inverse problem governed by the three-dimensional equations of linear elasticity, which is motivated by the inversion of the fault-slip field after an earthquake. △ Less

Submitted 16 May, 2024; v1 submitted 25 October, 2023; originally announced October 2023.

Comments: 20 pages, 7 figures

MSC Class: 65C60; 90C31; 62F15; 35R30; 65F55

arXiv:2309.16681 [pdf, other]

Alternate Learning based Sparse Semantic Communications for Visual Transmission

Authors: Siyu Tong, Xiaoxue Yu, Rongpeng Li, Kun Lu, Zhifeng Zhao, Honggang Zhang

Abstract: Semantic communication (SemCom) demonstrates strong superiority over conventional bit-level accurate transmission, by only attempting to recover the essential semantic information of data. In this paper, in order to tackle the non-differentiability of channels, we propose an alternate learning based SemCom system for visual transmission, named SparseSBC. Specially, SparseSBC leverages two separate… ▽ More Semantic communication (SemCom) demonstrates strong superiority over conventional bit-level accurate transmission, by only attempting to recover the essential semantic information of data. In this paper, in order to tackle the non-differentiability of channels, we propose an alternate learning based SemCom system for visual transmission, named SparseSBC. Specially, SparseSBC leverages two separate Deep Neural Network (DNN)-based models at the transmitter and receiver, respectively, and learns the encoding and decoding in an alternate manner, rather than the joint optimization in existing literature, so as to solving the non-differentiability in the channel. In particular, a ``self-critic" training scheme is leveraged for stable training. Moreover, the DNN-based transmitter generates a sparse set of bits in deduced ``semantic bases", by further incorporating a binary quantization module on the basis of minimal detrimental effect to the semantic accuracy. Extensive simulation results validate that SparseSBC shows efficient and effective transmission performance under various channel conditions, and outperforms typical SemCom solutions. △ Less

Submitted 30 July, 2023; originally announced September 2023.

arXiv:2309.10313 [pdf, other]

Investigating the Catastrophic Forgetting in Multimodal Large Language Models

Authors: Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, Yong Jae Lee, Yi Ma

Abstract: Following the success of GPT4, there has been a surge in interest in multimodal large language model (MLLM) research. This line of research focuses on developing general-purpose LLMs through fine-tuning pre-trained LLMs and vision models. However, catastrophic forgetting, a notorious phenomenon where the fine-tuned model fails to retain similar performance compared to the pre-trained model, still… ▽ More Following the success of GPT4, there has been a surge in interest in multimodal large language model (MLLM) research. This line of research focuses on developing general-purpose LLMs through fine-tuning pre-trained LLMs and vision models. However, catastrophic forgetting, a notorious phenomenon where the fine-tuned model fails to retain similar performance compared to the pre-trained model, still remains an inherent problem in multimodal LLMs (MLLM). In this paper, we introduce EMT: Evaluating MulTimodality for evaluating the catastrophic forgetting in MLLMs, by treating each MLLM as an image classifier. We first apply EMT to evaluate several open-source fine-tuned MLLMs and we discover that almost all evaluated MLLMs fail to retain the same performance levels as their vision encoders on standard image classification tasks. Moreover, we continue fine-tuning LLaVA, an MLLM and utilize EMT to assess performance throughout the fine-tuning. Interestingly, our results suggest that early-stage fine-tuning on an image dataset improves performance across other image datasets, by enhancing the alignment of text and visual features. However, as fine-tuning proceeds, the MLLMs begin to hallucinate, resulting in a significant loss of generalizability, even when the image encoder remains frozen. Our results suggest that MLLMs have yet to demonstrate performance on par with their vision models on standard image classification tasks and the current MLLM fine-tuning procedure still has room for improvement. △ Less

Submitted 5 December, 2023; v1 submitted 19 September, 2023; originally announced September 2023.

arXiv:2309.09449 [pdf]

Multi-Affiliated Authors Behave Differently across Fields and Host Country Preferences: A Comparison in G7 and BRICS

Authors: Sichao Tong, Liying Yang

Abstract: This paper study author simultaneously engaged in multiple affiliations based on bibliometric data covered in the Web of Science for the 2017-2021 period. Based on the affiliation information in publication records, we propose a general classification for multiple affiliations within-country or cross-country for analyzing authors' behavior in multiple affiliations and preferences of host countries… ▽ More This paper study author simultaneously engaged in multiple affiliations based on bibliometric data covered in the Web of Science for the 2017-2021 period. Based on the affiliation information in publication records, we propose a general classification for multiple affiliations within-country or cross-country for analyzing authors' behavior in multiple affiliations and preferences of host countries across research fields. We find a decrease in publications led by international multi-affiliated authorship after 2020, and China has shown a falling trend after 2018. More G7 countries are active in fields like Social Sciences, Clinical and Life Sciences. China, India, and Russia are active in physical sciences-related fields. Countries prefer to affiliate with G7 countries, especially in Clinical and Life Sciences. These findings may provide more insights into the understanding of the behavior and productivity of multi-affiliated researchers in the current academic landscape. △ Less

Submitted 17 September, 2023; originally announced September 2023.

Showing 1–50 of 127 results for author: Tong, S