Search | arXiv e-print repository

DipLLM: Fine-Tuning LLM for Strategic Decision-making in Diplomacy

Authors: Kaixuan Xu, Jiajun Chai, Sicheng Li, Yuqian Fu, Yuanheng Zhu, Dongbin Zhao

Abstract: Diplomacy is a complex multiplayer game that requires both cooperation and competition, posing significant challenges for AI systems. Traditional methods rely on equilibrium search to generate extensive game data for training, which demands substantial computational resources. Large Language Models (LLMs) offer a promising alternative, leveraging pre-trained knowledge to achieve strong performance… ▽ More Diplomacy is a complex multiplayer game that requires both cooperation and competition, posing significant challenges for AI systems. Traditional methods rely on equilibrium search to generate extensive game data for training, which demands substantial computational resources. Large Language Models (LLMs) offer a promising alternative, leveraging pre-trained knowledge to achieve strong performance with relatively small-scale fine-tuning. However, applying LLMs to Diplomacy remains challenging due to the exponential growth of possible action combinations and the intricate strategic interactions among players. To address this challenge, we propose DipLLM, a fine-tuned LLM-based agent that learns equilibrium policies for Diplomacy. DipLLM employs an autoregressive factorization framework to simplify the complex task of multi-unit action assignment into a sequence of unit-level decisions. By defining an equilibrium policy within this framework as the learning objective, we fine-tune the model using only 1.5% of the data required by the state-of-the-art Cicero model, surpassing its performance. Our results demonstrate the potential of fine-tuned LLMs for tackling complex strategic decision-making in multiplayer games. △ Less

Submitted 11 June, 2025; originally announced June 2025.

Comments: Accepted to the 42nd International Conference on Machine Learning (ICML 2025)

arXiv:2506.05904 [pdf, ps, other]

Proactive Assistant Dialogue Generation from Streaming Egocentric Videos

Authors: Yichi Zhang, Xin Luna Dong, Zhaojiang Lin, Andrea Madotto, Anuj Kumar, Babak Damavandi, Joyce Chai, Seungwhan Moon

Abstract: Recent advances in conversational AI have been substantial, but developing real-time systems for perceptual task guidance remains challenging. These systems must provide interactive, proactive assistance based on streaming visual inputs, yet their development is constrained by the costly and labor-intensive process of data collection and system evaluation. To address these limitations, we present… ▽ More Recent advances in conversational AI have been substantial, but developing real-time systems for perceptual task guidance remains challenging. These systems must provide interactive, proactive assistance based on streaming visual inputs, yet their development is constrained by the costly and labor-intensive process of data collection and system evaluation. To address these limitations, we present a comprehensive framework with three key contributions. First, we introduce a novel data curation pipeline that synthesizes dialogues from annotated egocentric videos, resulting in \dataset, a large-scale synthetic dialogue dataset spanning multiple domains. Second, we develop a suite of automatic evaluation metrics, validated through extensive human studies. Third, we propose an end-to-end model that processes streaming video inputs to generate contextually appropriate responses, incorporating novel techniques for handling data imbalance and long-duration videos. This work lays the foundation for developing real-time, proactive AI assistants capable of guiding users through diverse tasks. Project page: https://pro-assist.github.io/ △ Less

Submitted 6 June, 2025; originally announced June 2025.

arXiv:2506.02112 [pdf, ps, other]

SAB3R: Semantic-Augmented Backbone in 3D Reconstruction

Authors: Xuweiyi Chen, Tian Xia, Sihan Xu, Jianing Yang, Joyce Chai, Zezhou Cheng

Abstract: We introduce a new task, Map and Locate, which unifies the traditionally distinct objectives of open-vocabulary segmentation - detecting and segmenting object instances based on natural language queries - and 3D reconstruction, the process of estimating a scene's 3D structure from visual inputs. Specifically, Map and Locate involves generating a point cloud from an unposed video and segmenting obj… ▽ More We introduce a new task, Map and Locate, which unifies the traditionally distinct objectives of open-vocabulary segmentation - detecting and segmenting object instances based on natural language queries - and 3D reconstruction, the process of estimating a scene's 3D structure from visual inputs. Specifically, Map and Locate involves generating a point cloud from an unposed video and segmenting object instances based on open-vocabulary queries. This task serves as a critical step toward real-world embodied AI applications and introduces a practical task that bridges reconstruction, recognition and reorganization. To tackle this task, we introduce a simple yet effective baseline, which we denote as SAB3R. Our approach builds upon MASt3R, a recent breakthrough in 3D computer vision, and incorporates a lightweight distillation strategy. This method transfers dense, per-pixel semantic features from 2D vision backbones (eg, CLIP and DINOv2) to enhance MASt3R's capabilities. Without introducing any auxiliary frozen networks, our model generates per-pixel semantic features and constructs cohesive point maps in a single forward pass. Compared to separately deploying MASt3R and CLIP, our unified model, SAB3R, achieves superior performance on the Map and Locate benchmark. Furthermore, we evaluate SAB3R on both 2D semantic segmentation and 3D tasks to comprehensively validate its effectiveness. △ Less

Submitted 3 June, 2025; v1 submitted 2 June, 2025; originally announced June 2025.

Comments: 3D-LLM/VLA @ CVPR2025 | Project page: https://uva-computer-vision-lab.github.io/sab3r/

arXiv:2506.00439 [pdf, ps, other]

RLAE: Reinforcement Learning-Assisted Ensemble for LLMs

Authors: Yuqian Fu, Yuanheng Zhu, Jiajun Chai, Guojun Yin, Wei Lin, Qichao Zhang, Dongbin Zhao

Abstract: Ensembling large language models (LLMs) can effectively combine diverse strengths of different models, offering a promising approach to enhance performance across various tasks. However, existing methods typically rely on fixed weighting strategies that fail to adapt to the dynamic, context-dependent characteristics of LLM capabilities. In this work, we propose Reinforcement Learning-Assisted Ense… ▽ More Ensembling large language models (LLMs) can effectively combine diverse strengths of different models, offering a promising approach to enhance performance across various tasks. However, existing methods typically rely on fixed weighting strategies that fail to adapt to the dynamic, context-dependent characteristics of LLM capabilities. In this work, we propose Reinforcement Learning-Assisted Ensemble for LLMs (RLAE), a novel framework that reformulates LLM ensemble through the lens of a Markov Decision Process (MDP). Our approach introduces a RL agent that dynamically adjusts ensemble weights by considering both input context and intermediate generation states, with the agent being trained using rewards that directly correspond to the quality of final outputs. We implement RLAE using both single-agent and multi-agent reinforcement learning algorithms ($\text{RLAE}_\text{PPO}$ and $\text{RLAE}_\text{MAPPO}$ ), demonstrating substantial improvements over conventional ensemble methods. Extensive evaluations on a diverse set of tasks show that RLAE outperforms existing approaches by up to $3.3\%$ accuracy points, offering a more effective framework for LLM ensembling. Furthermore, our method exhibits superior generalization capabilities across different tasks without the need for retraining, while simultaneously achieving lower time latency. △ Less

Submitted 31 May, 2025; originally announced June 2025.

arXiv:2505.23723 [pdf, ps, other]

ML-Agent: Reinforcing LLM Agents for Autonomous Machine Learning Engineering

Authors: Zexi Liu, Jingyi Chai, Xinyu Zhu, Shuo Tang, Rui Ye, Bo Zhang, Lei Bai, Siheng Chen

Abstract: The emergence of large language model (LLM)-based agents has significantly advanced the development of autonomous machine learning (ML) engineering. However, most existing approaches rely heavily on manual prompt engineering, failing to adapt and optimize based on diverse experimental experiences. Focusing on this, for the first time, we explore the paradigm of learning-based agentic ML, where an… ▽ More The emergence of large language model (LLM)-based agents has significantly advanced the development of autonomous machine learning (ML) engineering. However, most existing approaches rely heavily on manual prompt engineering, failing to adapt and optimize based on diverse experimental experiences. Focusing on this, for the first time, we explore the paradigm of learning-based agentic ML, where an LLM agent learns through interactive experimentation on ML tasks using online reinforcement learning (RL). To realize this, we propose a novel agentic ML training framework with three key components: (1) exploration-enriched fine-tuning, which enables LLM agents to generate diverse actions for enhanced RL exploration; (2) step-wise RL, which enables training on a single action step, accelerating experience collection and improving training efficiency; (3) an agentic ML-specific reward module, which unifies varied ML feedback signals into consistent rewards for RL optimization. Leveraging this framework, we train ML-Agent, driven by a 7B-sized Qwen-2.5 LLM for autonomous ML. Remarkably, despite being trained on merely 9 ML tasks, our 7B-sized ML-Agent outperforms the 671B-sized DeepSeek-R1 agent. Furthermore, it achieves continuous performance improvements and demonstrates exceptional cross-task generalization capabilities. △ Less

Submitted 29 May, 2025; originally announced May 2025.

arXiv:2505.19381 [pdf, ps, other]

DiffVLA: Vision-Language Guided Diffusion Planning for Autonomous Driving

Authors: Anqing Jiang, Yu Gao, Zhigang Sun, Yiru Wang, Jijun Wang, Jinghao Chai, Qian Cao, Yuweng Heng, Hao Jiang, Yunda Dong, Zongzheng Zhang, Xianda Guo, Hao Sun, Hao Zhao

Abstract: Research interest in end-to-end autonomous driving has surged owing to its fully differentiable design integrating modular tasks, i.e. perception, prediction and planing, which enables optimization in pursuit of the ultimate goal. Despite the great potential of the end-to-end paradigm, existing methods suffer from several aspects including expensive BEV (bird's eye view) computation, action divers… ▽ More Research interest in end-to-end autonomous driving has surged owing to its fully differentiable design integrating modular tasks, i.e. perception, prediction and planing, which enables optimization in pursuit of the ultimate goal. Despite the great potential of the end-to-end paradigm, existing methods suffer from several aspects including expensive BEV (bird's eye view) computation, action diversity, and sub-optimal decision in complex real-world scenarios. To address these challenges, we propose a novel hybrid sparse-dense diffusion policy, empowered by a Vision-Language Model (VLM), called Diff-VLA. We explore the sparse diffusion representation for efficient multi-modal driving behavior. Moreover, we rethink the effectiveness of VLM driving decision and improve the trajectory generation guidance through deep interaction across agent, map instances and VLM output. Our method shows superior performance in Autonomous Grand Challenge 2025 which contains challenging real and reactive synthetic scenarios. Our methods achieves 45.0 PDMS. △ Less

Submitted 2 June, 2025; v1 submitted 25 May, 2025; originally announced May 2025.

Comments: 4pages

arXiv:2505.11326 [pdf, ps, other]

Temporally-Grounded Language Generation: A Benchmark for Real-Time Vision-Language Models

Authors: Keunwoo Peter Yu, Joyce Chai

Abstract: Vision-language models (VLMs) have shown remarkable progress in offline tasks such as image captioning and video question answering. However, real-time interactive environments impose new demands on VLMs, requiring them to generate utterances that are not only semantically accurate but also precisely timed. We identify two core capabilities necessary for such settings --… ▽ More Vision-language models (VLMs) have shown remarkable progress in offline tasks such as image captioning and video question answering. However, real-time interactive environments impose new demands on VLMs, requiring them to generate utterances that are not only semantically accurate but also precisely timed. We identify two core capabilities necessary for such settings -- $\textit{perceptual updating}$ and $\textit{contingency awareness}$ -- and propose a new benchmark task, $\textbf{Temporally-Grounded Language Generation (TGLG)}$, to evaluate them. TGLG requires models to generate utterances in response to streaming video such that both content and timing align with dynamic visual input. To support this benchmark, we curate evaluation datasets from sports broadcasting and egocentric human interaction domains, and introduce a new metric, $\textbf{TRACE}$, to evaluate TGLG by jointly measuring semantic similarity and temporal alignment. Finally, we present $\textbf{Vision-Language Model with Time-Synchronized Interleaving (VLM-TSI)}$, a model that interleaves visual and linguistic tokens in a time-synchronized manner, enabling real-time language generation without relying on turn-based assumptions. Experimental results show that VLM-TSI significantly outperforms a strong baseline, yet overall performance remains modest -- highlighting the difficulty of TGLG and motivating further research in real-time VLMs. Code and data available $\href{https://github.com/yukw777/tglg}{here}$. △ Less

Submitted 16 May, 2025; originally announced May 2025.

Comments: 18 pages

arXiv:2505.08808 [pdf, ps, other]

SparseMeXT Unlocking the Potential of Sparse Representations for HD Map Construction

Authors: Anqing Jiang, Jinhao Chai, Yu Gao, Yiru Wang, Yuwen Heng, Zhigang Sun, Hao Sun, Zezhong Zhao, Li Sun, Jian Zhou, Lijuan Zhu, Shugong Xu, Hao Zhao

Abstract: Recent advancements in high-definition \emph{HD} map construction have demonstrated the effectiveness of dense representations, which heavily rely on computationally intensive bird's-eye view \emph{BEV} features. While sparse representations offer a more efficient alternative by avoiding dense BEV processing, existing methods often lag behind due to the lack of tailored designs. These limitations… ▽ More Recent advancements in high-definition \emph{HD} map construction have demonstrated the effectiveness of dense representations, which heavily rely on computationally intensive bird's-eye view \emph{BEV} features. While sparse representations offer a more efficient alternative by avoiding dense BEV processing, existing methods often lag behind due to the lack of tailored designs. These limitations have hindered the competitiveness of sparse representations in online HD map construction. In this work, we systematically revisit and enhance sparse representation techniques, identifying key architectural and algorithmic improvements that bridge the gap with--and ultimately surpass--dense approaches. We introduce a dedicated network architecture optimized for sparse map feature extraction, a sparse-dense segmentation auxiliary task to better leverage geometric and semantic cues, and a denoising module guided by physical priors to refine predictions. Through these enhancements, our method achieves state-of-the-art performance on the nuScenes dataset, significantly advancing HD map construction and centerline detection. Specifically, SparseMeXt-Tiny reaches a mean average precision \emph{mAP} of 55.5% at 32 frames per second \emph{fps}, while SparseMeXt-Base attains 65.2% mAP. Scaling the backbone and decoder further, SparseMeXt-Large achieves an mAP of 68.9% at over 20 fps, establishing a new benchmark for sparse representations in HD map construction. These results underscore the untapped potential of sparse methods, challenging the conventional reliance on dense representations and redefining efficiency-performance trade-offs in the field. △ Less

Submitted 11 May, 2025; originally announced May 2025.

arXiv:2505.02462 [pdf, other]

Incentivizing Inclusive Contributions in Model Sharing Markets

Authors: Enpei Zhang, Jingyi Chai, Rui Ye, Yanfeng Wang, Siheng Chen

Abstract: While data plays a crucial role in training contemporary AI models, it is acknowledged that valuable public data will be exhausted in a few years, directing the world's attention towards the massive decentralized private data. However, the privacy-sensitive nature of raw data and lack of incentive mechanism prevent these valuable data from being fully exploited. Addressing these challenges, this p… ▽ More While data plays a crucial role in training contemporary AI models, it is acknowledged that valuable public data will be exhausted in a few years, directing the world's attention towards the massive decentralized private data. However, the privacy-sensitive nature of raw data and lack of incentive mechanism prevent these valuable data from being fully exploited. Addressing these challenges, this paper proposes inclusive and incentivized personalized federated learning (iPFL), which incentivizes data holders with diverse purposes to collaboratively train personalized models without revealing raw data. iPFL constructs a model-sharing market by solving a graph-based training optimization and incorporates an incentive mechanism based on game theory principles. Theoretical analysis shows that iPFL adheres to two key incentive properties: individual rationality and truthfulness. Empirical studies on eleven AI tasks (e.g., large language models' instruction-following tasks) demonstrate that iPFL consistently achieves the highest economic utility, and better or comparable model performance compared to baseline methods. We anticipate that our iPFL can serve as a valuable technique for boosting future AI models on decentralized private data while making everyone satisfied. △ Less

Submitted 5 May, 2025; originally announced May 2025.

arXiv:2504.16060 [pdf, other]

Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation

Authors: Ziqiao Ma, Jing Ding, Xuejun Zhang, Dezhi Luo, Jiahe Ding, Sihan Xu, Yuchen Huang, Run Peng, Joyce Chai

Abstract: Referring Expression Generation (REG) is a core task for evaluating the pragmatic competence of vision-language systems, requiring not only accurate semantic grounding but also adherence to principles of cooperative communication (Grice, 1975). However, current evaluations of vision-language models (VLMs) often overlook the pragmatic dimension, reducing REG to a region-based captioning task and ne… ▽ More Referring Expression Generation (REG) is a core task for evaluating the pragmatic competence of vision-language systems, requiring not only accurate semantic grounding but also adherence to principles of cooperative communication (Grice, 1975). However, current evaluations of vision-language models (VLMs) often overlook the pragmatic dimension, reducing REG to a region-based captioning task and neglecting Gricean maxims. In this work, we revisit REG from a pragmatic perspective, introducing a new dataset (RefOI) of 1.5k images annotated with both written and spoken referring expressions. Through a systematic evaluation of state-of-the-art VLMs, we identify three key failures of pragmatic competence: (1) failure to uniquely identify the referent, (2) inclusion of excessive or irrelevant information, and (3) misalignment with human pragmatic preference, such as the underuse of minimal spatial cues. We also show that standard automatic evaluations fail to capture these pragmatic violations, reinforcing superficial cues rather than genuine referential success. Our findings call for a renewed focus on pragmatically informed models and evaluation frameworks that align with real human communication. △ Less

Submitted 30 April, 2025; v1 submitted 22 April, 2025; originally announced April 2025.

Comments: Homepage: https://vlm-reg.github.io/

arXiv:2503.14350 [pdf, other]

VEGGIE: Instructional Editing and Reasoning of Video Concepts with Grounded Generation

Authors: Shoubin Yu, Difan Liu, Ziqiao Ma, Yicong Hong, Yang Zhou, Hao Tan, Joyce Chai, Mohit Bansal

Abstract: Recent video diffusion models have enhanced video editing, but it remains challenging to handle instructional editing and diverse tasks (e.g., adding, removing, changing) within a unified framework. In this paper, we introduce VEGGIE, a Video Editor with Grounded Generation from Instructions, a simple end-to-end framework that unifies video concept editing, grounding, and reasoning based on divers… ▽ More Recent video diffusion models have enhanced video editing, but it remains challenging to handle instructional editing and diverse tasks (e.g., adding, removing, changing) within a unified framework. In this paper, we introduce VEGGIE, a Video Editor with Grounded Generation from Instructions, a simple end-to-end framework that unifies video concept editing, grounding, and reasoning based on diverse user instructions. Specifically, given a video and text query, VEGGIE first utilizes an MLLM to interpret user intentions in instructions and ground them to the video contexts, generating frame-specific grounded task queries for pixel-space responses. A diffusion model then renders these plans and generates edited videos that align with user intent. To support diverse tasks and complex instructions, we employ a curriculum learning strategy: first aligning the MLLM and video diffusion model with large-scale instructional image editing data, followed by end-to-end fine-tuning on high-quality multitask video data. Additionally, we introduce a novel data synthesis pipeline to generate paired instructional video editing data for model training. It transforms static image data into diverse, high-quality video editing samples by leveraging Image-to-Video models to inject dynamics. VEGGIE shows strong performance in instructional video editing with different editing skills, outperforming the best instructional baseline as a versatile model, while other models struggle with multi-tasking. VEGGIE also excels in video object grounding and reasoning segmentation, where other baselines fail. We further reveal how the multiple tasks help each other and highlight promising applications like zero-shot multimodal instructional and in-context video editing. △ Less

Submitted 19 March, 2025; v1 submitted 18 March, 2025; originally announced March 2025.

Comments: First three authors contributed equally. Project page: https://veggie-gen.github.io/

arXiv:2502.13311 [pdf, other]

Training Turn-by-Turn Verifiers for Dialogue Tutoring Agents: The Curious Case of LLMs as Your Coding Tutors

Authors: Jian Wang, Yinpei Dai, Yichi Zhang, Ziqiao Ma, Wenjie Li, Joyce Chai

Abstract: Intelligent tutoring agents powered by large language models (LLMs) have been increasingly explored to deliver personalized knowledge in areas such as language learning and science education. However, their capabilities in guiding users to solve complex real-world tasks remain underexplored. To address this limitation, in this work, we focus on coding tutoring, a challenging problem that requires… ▽ More Intelligent tutoring agents powered by large language models (LLMs) have been increasingly explored to deliver personalized knowledge in areas such as language learning and science education. However, their capabilities in guiding users to solve complex real-world tasks remain underexplored. To address this limitation, in this work, we focus on coding tutoring, a challenging problem that requires tutors to proactively guide students towards completing predefined coding tasks. We propose a novel agent workflow, Trace-and-Verify (TRAVER), which combines knowledge tracing to estimate a student's knowledge state and turn-by-turn verification to ensure effective guidance toward task completion. We introduce DICT, an automatic evaluation protocol that assesses tutor agents using controlled student simulation and code generation tests. Extensive experiments reveal the challenges of coding tutoring and demonstrate that TRAVER achieves a significantly higher success rate. Although we use code tutoring as an example in this paper, our approach can be extended beyond coding, providing valuable insights into advancing tutoring agents for human task learning. △ Less

Submitted 25 May, 2025; v1 submitted 18 February, 2025; originally announced February 2025.

Comments: Accepted to Findings of ACL 2025

arXiv:2502.02800 [pdf, other]

The $CP$ violations and branching ratios for $B_c^+\to D_{(s)}^+π^+π^-(K^{+}K^{-})$ from interference of the vector mesons in Perturbative QCD

Authors: Kun Shuai Ye, Gang Lü, Na-Wang, Jian Chai, Xin-Heng Guo

Abstract: Within the framework of the perturbative QCD approach utilizing $K_T$ factorization, we have investigated the CP violations and branching ratios in the decay processes of $B_{c}^{+}\to D_{(s)} ^{+}V(V\rightarrowπ^{+}π^{-})$ and $B_{c}^{+}\to D_{(s)}^{+}V(V\rightarrow K^{+}K^{-})$, where V denotes three vector mesons $ρ^0$, $ω$, and $φ$. During the $V\to π^+π^-$ and $V\to K^+K^-$ decay processes, w… ▽ More Within the framework of the perturbative QCD approach utilizing $K_T$ factorization, we have investigated the CP violations and branching ratios in the decay processes of $B_{c}^{+}\to D_{(s)} ^{+}V(V\rightarrowπ^{+}π^{-})$ and $B_{c}^{+}\to D_{(s)}^{+}V(V\rightarrow K^{+}K^{-})$, where V denotes three vector mesons $ρ^0$, $ω$, and $φ$. During the $V\to π^+π^-$ and $V\to K^+K^-$ decay processes, we incorporated the $ρ^{0}-ω-φ$ mixing mechanism to describe the amplitudes of these quasi-two-body decay processes. Within the interference regime of the three vector particles, we observed distinct changes in both CP violations and branching ratios. Furthermore, our study presents evidence for local CP violations and branching ratios that warrants further investigation through experiments. △ Less

Submitted 4 February, 2025; originally announced February 2025.

Comments: arXiv admin note: text overlap with arXiv:2309.15351

arXiv:2502.00534 [pdf, ps, other]

Transition Transfer $Q$-Learning for Composite Markov Decision Processes

Authors: Jinhang Chai, Elynn Chen, Lin Yang

Abstract: To bridge the gap between empirical success and theoretical understanding in transfer reinforcement learning (RL), we study a principled approach with provable performance guarantees. We introduce a novel composite MDP framework where high-dimensional transition dynamics are modeled as the sum of a low-rank component representing shared structure and a sparse component capturing task-specific vari… ▽ More To bridge the gap between empirical success and theoretical understanding in transfer reinforcement learning (RL), we study a principled approach with provable performance guarantees. We introduce a novel composite MDP framework where high-dimensional transition dynamics are modeled as the sum of a low-rank component representing shared structure and a sparse component capturing task-specific variations. This relaxes the common assumption of purely low-rank transition models, allowing for more realistic scenarios where tasks share core dynamics but maintain individual variations. We introduce UCB-TQL (Upper Confidence Bound Transfer Q-Learning), designed for transfer RL scenarios where multiple tasks share core linear MDP dynamics but diverge along sparse dimensions. When applying UCB-TQL to a target task after training on a source task with sufficient trajectories, we achieve a regret bound of $\tilde{O}(\sqrt{eH^5N})$ that scales independently of the ambient dimension. Here, $N$ represents the number of trajectories in the target task, while $e$ quantifies the sparse differences between tasks. This result demonstrates substantial improvement over single task RL by effectively leveraging their structural similarities. Our theoretical analysis provides rigorous guarantees for how UCB-TQL simultaneously exploits shared dynamics while adapting to task-specific variations. △ Less

Submitted 1 February, 2025; originally announced February 2025.

arXiv:2501.13928 [pdf, other]

Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass

Authors: Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, Matt Feiszli

Abstract: Multi-view 3D reconstruction remains a core challenge in computer vision, particularly in applications requiring accurate and scalable representations across diverse perspectives. Current leading methods such as DUSt3R employ a fundamentally pairwise approach, processing images in pairs and necessitating costly global alignment procedures to reconstruct from multiple views. In this work, we propos… ▽ More Multi-view 3D reconstruction remains a core challenge in computer vision, particularly in applications requiring accurate and scalable representations across diverse perspectives. Current leading methods such as DUSt3R employ a fundamentally pairwise approach, processing images in pairs and necessitating costly global alignment procedures to reconstruct from multiple views. In this work, we propose Fast 3D Reconstruction (Fast3R), a novel multi-view generalization to DUSt3R that achieves efficient and scalable 3D reconstruction by processing many views in parallel. Fast3R's Transformer-based architecture forwards N images in a single forward pass, bypassing the need for iterative alignment. Through extensive experiments on camera pose estimation and 3D reconstruction, Fast3R demonstrates state-of-the-art performance, with significant improvements in inference speed and reduced error accumulation. These results establish Fast3R as a robust alternative for multi-view applications, offering enhanced scalability without compromising reconstruction accuracy. △ Less

Submitted 19 March, 2025; v1 submitted 23 January, 2025; originally announced January 2025.

Comments: CVPR 2025. Project website: https://fast3r-3d.github.io/

arXiv:2501.08783 [pdf, ps, other]

Form factors of light pseudoscalar mesons from the perturbative QCD approach

Authors: Jian Chai, Shan Cheng

Abstract: We study the electromagnetic and meson-photon transition form factors (TFF) of light pseudoscalar mesons from the perturbative QCD (pQCD) approach. To comprehensively account for both the longitudinal and transverse nonperturbative dynamics of hadronic constituents, we incorpoarate intrinsic transverse momentum distributions (iTMDs) alongside the conventional light-cone distribution amplitudes (LC… ▽ More We study the electromagnetic and meson-photon transition form factors (TFF) of light pseudoscalar mesons from the perturbative QCD (pQCD) approach. To comprehensively account for both the longitudinal and transverse nonperturbative dynamics of hadronic constituents, we incorpoarate intrinsic transverse momentum distributions (iTMDs) alongside the conventional light-cone distribution amplitudes (LCDAs). The main motivations of this work are the disjointedness of electromagnetic form factors between the theoretical predictions and the experimental measurements, and the BaBar-Belle tension of pion-photon transition form factor in the large momentum transfers. Our calculation is carried out at the next-to-leading-order for the contributions from leading and subleading twist LCDAs, and leading order for the twist four contributions. Notably, this work presents the first systematic evaluation of higher-twist contributions to meson-photon TFFs. The key findings are: (a) iTMDs play a crucial role in describing form factor data, particularly in the small-to-intermediate momentum transfer region where they induce significant modifications to pQCD predictions. (b) The extracted transverse size parameters for valence quark states are found to be $β_π^2 = 0.51 \pm 0.04$ GeV$^{-2}$ and $β_K^2 = 0.30 \pm 0.05$ GeV$^{-2}$, the chiral mass of pion meson $m_0^π$ at $1$ GeV is determined to be $1.84 \pm 0.07$ GeV. (c) The meson-photon TFFs are predominantly governed by leading-twist LCDAs. The iTMDs-enhanced pQCD results show better agreement with Belle's pion TFF data across intermediate and large momentum transfers and favor a small $η-η^\prime$ mixing angle. (d) Remarkably, the inclusion of iTMDs extends the applicability of pQCD calculations down to a few GeV$^2$ for all considered form factors, significantly improving the theory-data consistency. △ Less

Submitted 4 June, 2025; v1 submitted 15 January, 2025; originally announced January 2025.

Comments: 46 pages, 14 figures, 6 tables, figure 7 updated, version to appear in JHEP

arXiv:2501.04870 [pdf, other]

Deep Transfer $Q$-Learning for Offline Non-Stationary Reinforcement Learning

Authors: Jinhang Chai, Elynn Chen, Jianqing Fan

Abstract: In dynamic decision-making scenarios across business and healthcare, leveraging sample trajectories from diverse populations can significantly enhance reinforcement learning (RL) performance for specific target populations, especially when sample sizes are limited. While existing transfer learning methods primarily focus on linear regression settings, they lack direct applicability to reinforcemen… ▽ More In dynamic decision-making scenarios across business and healthcare, leveraging sample trajectories from diverse populations can significantly enhance reinforcement learning (RL) performance for specific target populations, especially when sample sizes are limited. While existing transfer learning methods primarily focus on linear regression settings, they lack direct applicability to reinforcement learning algorithms. This paper pioneers the study of transfer learning for dynamic decision scenarios modeled by non-stationary finite-horizon Markov decision processes, utilizing neural networks as powerful function approximators and backward inductive learning. We demonstrate that naive sample pooling strategies, effective in regression settings, fail in Markov decision processes.To address this challenge, we introduce a novel ``re-weighted targeting procedure'' to construct ``transferable RL samples'' and propose ``transfer deep $Q^*$-learning'', enabling neural network approximation with theoretical guarantees. We assume that the reward functions are transferable and deal with both situations in which the transition densities are transferable or nontransferable. Our analytical techniques for transfer learning in neural network approximation and transition density transfers have broader implications, extending to supervised transfer learning with neural networks and domain shift scenarios. Empirical experiments on both synthetic and real datasets corroborate the advantages of our method, showcasing its potential for improving decision-making through strategically constructing transferable RL samples in non-stationary reinforcement learning contexts. △ Less

Submitted 11 April, 2025; v1 submitted 8 January, 2025; originally announced January 2025.

arXiv:2412.19252 [pdf, other]

Localized exploration in contextual dynamic pricing achieves dimension-free regret

Authors: Jinhang Chai, Yaqi Duan, Jianqing Fan, Kaizheng Wang

Abstract: We study the problem of contextual dynamic pricing with a linear demand model. We propose a novel localized exploration-then-commit (LetC) algorithm which starts with a pure exploration stage, followed by a refinement stage that explores near the learned optimal pricing policy, and finally enters a pure exploitation stage. The algorithm is shown to achieve a minimax optimal, dimension-free regret… ▽ More We study the problem of contextual dynamic pricing with a linear demand model. We propose a novel localized exploration-then-commit (LetC) algorithm which starts with a pure exploration stage, followed by a refinement stage that explores near the learned optimal pricing policy, and finally enters a pure exploitation stage. The algorithm is shown to achieve a minimax optimal, dimension-free regret bound when the time horizon exceeds a polynomial of the covariate dimension. Furthermore, we provide a general theoretical framework that encompasses the entire time spectrum, demonstrating how to balance exploration and exploitation when the horizon is limited. The analysis is powered by a novel critical inequality that depicts the exploration-exploitation trade-off in dynamic pricing, mirroring its existing counterpart for the bias-variance trade-off in regularized regression. Our theoretical results are validated by extensive experiments on synthetic and real-world data. △ Less

Submitted 26 December, 2024; originally announced December 2024.

Comments: 60 pages, 9 figures

arXiv:2412.11927 [pdf, other]

Transparent and Coherent Procedural Mistake Detection

Authors: Shane Storks, Itamar Bar-Yossef, Yayuan Li, Zheyuan Zhang, Jason J. Corso, Joyce Chai

Abstract: Procedural mistake detection (PMD) is a challenging problem of classifying whether a human user (observed through egocentric video) has successfully executed a task (specified by a procedural text). Despite significant recent efforts, machine performance in the wild remains nonviable, and the reasoning processes underlying this performance are opaque. As such, we extend PMD to require generating v… ▽ More Procedural mistake detection (PMD) is a challenging problem of classifying whether a human user (observed through egocentric video) has successfully executed a task (specified by a procedural text). Despite significant recent efforts, machine performance in the wild remains nonviable, and the reasoning processes underlying this performance are opaque. As such, we extend PMD to require generating visual self-dialog rationales to inform decisions. Given the impressive, mature image understanding capabilities observed in recent vision-and-language models (VLMs), we curate a suitable benchmark dataset for PMD based on individual frames. As our reformulation enables unprecedented transparency, we leverage a natural language inference (NLI) model to formulate two automated metrics for the coherence of generated rationales. We establish baselines for this reframed task, showing that while VLMs struggle off-the-shelf, their accuracy, coherence, and efficiency can be improved by incorporating these metrics into common inference and fine-tuning methods- though not without tradeoff. Lastly, our multi-faceted metrics visualize common outcomes, highlighting areas for further improvement. △ Less

Submitted 27 May, 2025; v1 submitted 16 December, 2024; originally announced December 2024.

arXiv:2412.05941 [pdf, other]

Shedding light on the intrinsic transversal momentum distributions of pion and kaon

Authors: Jian Chai, Shan Cheng

Abstract: We propose to introduce the intrinsic transversal momentum distribution functions (iTMDs), in conjunction with the light-cone distribution amplitudes (LCDAs), to elucidate the probability amplitude of encountering a meson state wherein the partons swiftly traverse along the longitudinal axis while gently oscillating in the transversal plane. The primary motivation stems from the oversight of soft… ▽ More We propose to introduce the intrinsic transversal momentum distribution functions (iTMDs), in conjunction with the light-cone distribution amplitudes (LCDAs), to elucidate the probability amplitude of encountering a meson state wherein the partons swiftly traverse along the longitudinal axis while gently oscillating in the transversal plane. The primary motivation stems from the oversight of soft transverse dynamics within the $k_T$ factorization formalism of an exclusive QCD process, which confines perturbative QCD (pQCD) predictions to scenarios involving large momentum transfers. We meticulously investigate the $π$ and $K$ electromagnetic form factors using the iTMDs-improved pQCD calculation at next-to-leading order. By analyzing data in the timelike physical regions, we obtain the transversal-size parameters $β_π^2 = 0.51 \pm 0.04$ GeV$^{-2}$ and $β_K^2 = 0.30 \pm 0.05$ GeV$^2$. We then extract the chiral mass of pion to be $m_0^π(1 \, {\rm GeV}) = 1.84 \pm 0.07$ GeV and explain the precise measurements of kaon form factor in the perturbative timelike region. As a remarkable byproduct, we found that the incorporation of iTMDs improves the pQCD predictions for electromagnetic form factors, extending the applicable range to a few GeV$^2$. This improvement allows for direct comparison with existing measurements and lattice QCD evaluations. △ Less

Submitted 22 March, 2025; v1 submitted 8 December, 2024; originally announced December 2024.

Comments: 7 pages, 6 figures, 1 table. Matches the version accepted in Physical Review D (Letter)

arXiv:2412.01708 [pdf, other]

Are We There Yet? Revealing the Risks of Utilizing Large Language Models in Scholarly Peer Review

Authors: Rui Ye, Xianghe Pang, Jingyi Chai, Jiaao Chen, Zhenfei Yin, Zhen Xiang, Xiaowen Dong, Jing Shao, Siheng Chen

Abstract: Scholarly peer review is a cornerstone of scientific advancement, but the system is under strain due to increasing manuscript submissions and the labor-intensive nature of the process. Recent advancements in large language models (LLMs) have led to their integration into peer review, with promising results such as substantial overlaps between LLM- and human-generated reviews. However, the unchecke… ▽ More Scholarly peer review is a cornerstone of scientific advancement, but the system is under strain due to increasing manuscript submissions and the labor-intensive nature of the process. Recent advancements in large language models (LLMs) have led to their integration into peer review, with promising results such as substantial overlaps between LLM- and human-generated reviews. However, the unchecked adoption of LLMs poses significant risks to the integrity of the peer review system. In this study, we comprehensively analyze the vulnerabilities of LLM-generated reviews by focusing on manipulation and inherent flaws. Our experiments show that injecting covert deliberate content into manuscripts allows authors to explicitly manipulate LLM reviews, leading to inflated ratings and reduced alignment with human reviews. In a simulation, we find that manipulating 5% of the reviews could potentially cause 12% of the papers to lose their position in the top 30% rankings. Implicit manipulation, where authors strategically highlight minor limitations in their papers, further demonstrates LLMs' susceptibility compared to human reviewers, with a 4.5 times higher consistency with disclosed limitations. Additionally, LLMs exhibit inherent flaws, such as potentially assigning higher ratings to incomplete papers compared to full papers and favoring well-known authors in single-blind review process. These findings highlight the risks of over-reliance on LLMs in peer review, underscoring that we are not yet ready for widespread adoption and emphasizing the need for robust safeguards. △ Less

Submitted 2 December, 2024; originally announced December 2024.

Comments: 27 pages, 24 figures

arXiv:2411.13176 [pdf, other]

doi 10.1103/PhysRevB.111.115304

Spin-phase transition in an array of quantum rings controlled by cavity photons

Authors: Vidar Gudmundsson, Vram Mughnetsyan, Hsi-Sheng Goan, Jeng-Da Chai, Nzar Rauf Abdullah, Chi-Shung Tang, Valeriu Moldoveanu, Andrei Manolescu

Abstract: We model a spin-phase transition in a two-dimensional square array, or a lateral superlattice, of quantum rings in an external perpendicular homogeneous magnetic field. The electron system is placed in a circular cylindrical far-infrared photon cavity with a single circularly symmetric photon mode. Our numerical results reveal that the spin ordering of the two-dimensional electron gas in each quan… ▽ More We model a spin-phase transition in a two-dimensional square array, or a lateral superlattice, of quantum rings in an external perpendicular homogeneous magnetic field. The electron system is placed in a circular cylindrical far-infrared photon cavity with a single circularly symmetric photon mode. Our numerical results reveal that the spin ordering of the two-dimensional electron gas in each quantum ring can be influenced or controlled by the electron-photon coupling strength and the energy of the photons. The Coulomb interaction between the electrons is described by a spin-density functional approach, but the para- and the diamagnetic electron-photon interactions are modeled via a configuration interaction formalism in a truncated many-body Fock-space, which is updated in each iteration step of the density functional approach. In the absence of external electromagnetic pulses this spin-phase transition is replicated in the orbital magnetization of the rings. The spin-phase transition can be suppressed by a strong electron-photon interaction. In addition, fluctuations in the spin configuration are found in dynamical calculations, where the system is excited by a time-dependent scheme specially fit for emphasizing the diamagnetic electron-photon interaction. △ Less

Submitted 20 November, 2024; originally announced November 2024.

Comments: RevTeX - pdfLaTeX, 11 pages with 9 included pdf figures

Journal ref: Phys. Rev. B 111, 115304 (2025)

arXiv:2411.08558 [pdf]

Effect of Top Al$_2$O$_3$ Interlayer Thickness on Memory Window and Reliability of FeFETs With TiN/Al$_2$O$_3$/Hf$_{0.5}$Zr$_{0.5}$O$_2$/SiO$_x$/Si (MIFIS) Gate Structure

Authors: Tao Hu, Xinpei Jia, Runhao Han, Jia Yang, Mingkai Bai, Saifei Dai, Zeqi Chen, Yajing Ding, Shuai Yang, Kai Han, Yanrong Wang, Jing Zhang, Yuanyuan Zhao, Xiaoyu Ke, Xiaoqing Sun, Junshuai Chai, Hao Xu, Xiaolei Wang, Wenwu Wang, Tianchun Ye

Abstract: We investigate the effect of top Al2O3 interlayer thickness on the memory window (MW) of Si channel ferroelectric field-effect transistors (Si-FeFETs) with TiN/Al$_2$O$_3$/Hf$_{0.5}$Zr$_{0.5}$O$_2$/SiO$_x$/Si (MIFIS) gate structure. We find that the MW first increases and then remains almost constant with the increasing thickness of the top Al2O3. The phenomenon is attributed to the lower electric… ▽ More We investigate the effect of top Al2O3 interlayer thickness on the memory window (MW) of Si channel ferroelectric field-effect transistors (Si-FeFETs) with TiN/Al$_2$O$_3$/Hf$_{0.5}$Zr$_{0.5}$O$_2$/SiO$_x$/Si (MIFIS) gate structure. We find that the MW first increases and then remains almost constant with the increasing thickness of the top Al2O3. The phenomenon is attributed to the lower electric field of the ferroelectric Hf$_{0.5}$Zr$_{0.5}$O$_2$ in the MIFIS structure with a thicker top Al2O3 after a program operation. The lower electric field makes the charges trapped at the top Al2O3/Hf0.5Zr0.5O$_2$ interface, which are injected from the metal gate, cannot be retained. Furthermore, we study the effect of the top Al$_2$O$_3$ interlayer thickness on the reliability (endurance characteristics and retention characteristics). We find that the MIFIS structure with a thicker top Al$_2$O$_3$ interlayer has poorer retention and endurance characteristics. Our work is helpful in deeply understanding the effect of top interlayer thickness on the MW and reliability of Si-FeFETs with MIFIS gate stacks. △ Less

Submitted 13 November, 2024; originally announced November 2024.

Comments: 7 pages, 12 figures

arXiv:2411.03968 [pdf, ps, other]

$B\to K\bar K(πη)h$ decays in the presence of isovector scalar resonances $a_0(980,1450)$

Authors: Si-Yang Wang, Zhi-Qing Zhang, Zhi-Jie Sun, Jian Chai, Peng Li

Abstract: Different from the previous treatment in a two-body framework, we introduce the dimeson distribution amplitudes (DAs) to describe the strong dynamics between the S-wave resonances $a_0(980, 1450)$ and the $K\bar K (πη)$ pair, where the Gegenbauer coefficient required is determined from the experimental data on the time-like form factors involved. The branching ratios and direct CP asymmetries of t… ▽ More Different from the previous treatment in a two-body framework, we introduce the dimeson distribution amplitudes (DAs) to describe the strong dynamics between the S-wave resonances $a_0(980, 1450)$ and the $K\bar K (πη)$ pair, where the Gegenbauer coefficient required is determined from the experimental data on the time-like form factors involved. The branching ratios and direct CP asymmetries of the decays $B \to a^{(\prime)}_0 h \to K\bar K(πη) h$, with $a_0=a_0(980)$, $a^{\prime}_0=a_0(1450)$ and $h$ referring to a pion or a kaon, are then calculated in the perturbative QCD (PQCD) approach. We find that the branching ratios of the corresponding quasi-two-body decays $B\to a^{(\prime)}_0 K$ obtained with the narrow width approximation are closer to those predicted in the QCD factorization (QCDF) approach compared to the previous PQCD calculations, no matter a three-body or a two-body framework is assumed. Furthermore, all our predictions for these $B\to a^{(\prime)}_0 K$ decays are below the current experimental upper limits except for those of decays $B^0\to a^{(\prime)-}_0K^+$, which are (slightly) larger than the upper limits. Under the narrow width approximation, the branching ratios of the decays $B^+\to a^{(\prime)+}_0π^0$, $B^0\to a^{(\prime)+}_0π^-$ and $B^0\to a^{(\prime)0}_0π^0$ are comparable to or agree well with the previous PQCD and the QCDF calculations. While for the decays $B^+\to a^{(\prime)0}_0π^+$ and $B^0\to a^{(\prime)-}_0π^+$, their branching ratios are predicted to be unexpectedly large, for example, the obtained branching ratio of decay $B^+\to a^0_0π^+$ is even higher than the current experimental upper limit. △ Less

Submitted 6 November, 2024; originally announced November 2024.

Comments: 22 pages, 2 figures

arXiv:2411.03603 [pdf, other]

CPIG: Leveraging Consistency Policy with Intention Guidance for Multi-agent Exploration

Authors: Yuqian Fu, Yuanheng Zhu, Haoran Li, Zijie Zhao, Jiajun Chai, Dongbin Zhao

Abstract: Efficient exploration is crucial in cooperative multi-agent reinforcement learning (MARL), especially in sparse-reward settings. However, due to the reliance on the unimodal policy, existing methods are prone to falling into the local optima, hindering the effective exploration of better policies. Furthermore, in sparse-reward settings, each agent tends to receive a scarce reward, which poses sign… ▽ More Efficient exploration is crucial in cooperative multi-agent reinforcement learning (MARL), especially in sparse-reward settings. However, due to the reliance on the unimodal policy, existing methods are prone to falling into the local optima, hindering the effective exploration of better policies. Furthermore, in sparse-reward settings, each agent tends to receive a scarce reward, which poses significant challenges to inter-agent cooperation. This not only increases the difficulty of policy learning but also degrades the overall performance of multi-agent tasks. To address these issues, we propose a Consistency Policy with Intention Guidance (CPIG), with two primary components: (a) introducing a multimodal policy to enhance the agent's exploration capability, and (b) sharing the intention among agents to foster agent cooperation. For component (a), CPIG incorporates a Consistency model as the policy, leveraging its multimodal nature and stochastic characteristics to facilitate exploration. Regarding component (b), we introduce an Intention Learner to deduce the intention on the global state from each agent's local observation. This intention then serves as a guidance for the Consistency Policy, promoting cooperation among agents. The proposed method is evaluated in multi-agent particle environments (MPE) and multi-agent MuJoCo (MAMuJoCo). Empirical results demonstrate that our method not only achieves comparable performance to various baselines in dense-reward environments but also significantly enhances performance in sparse-reward settings, outperforming state-of-the-art (SOTA) algorithms by 20%. △ Less

Submitted 6 December, 2024; v1 submitted 5 November, 2024; originally announced November 2024.

arXiv:2410.24218 [pdf, other]

Teaching Embodied Reinforcement Learning Agents: Informativeness and Diversity of Language Use

Authors: Jiajun Xi, Yinong He, Jianing Yang, Yinpei Dai, Joyce Chai

Abstract: In real-world scenarios, it is desirable for embodied agents to have the ability to leverage human language to gain explicit or implicit knowledge for learning tasks. Despite recent progress, most previous approaches adopt simple low-level instructions as language inputs, which may not reflect natural human communication. It's not clear how to incorporate rich language use to facilitate task learn… ▽ More In real-world scenarios, it is desirable for embodied agents to have the ability to leverage human language to gain explicit or implicit knowledge for learning tasks. Despite recent progress, most previous approaches adopt simple low-level instructions as language inputs, which may not reflect natural human communication. It's not clear how to incorporate rich language use to facilitate task learning. To address this question, this paper studies different types of language inputs in facilitating reinforcement learning (RL) embodied agents. More specifically, we examine how different levels of language informativeness (i.e., feedback on past behaviors and future guidance) and diversity (i.e., variation of language expressions) impact agent learning and inference. Our empirical results based on four RL benchmarks demonstrate that agents trained with diverse and informative language feedback can achieve enhanced generalization and fast adaptation to new tasks. These findings highlight the pivotal role of language use in teaching embodied agents new tasks in an open world. Project website: https://github.com/sled-group/Teachable_RL △ Less

Submitted 31 October, 2024; originally announced October 2024.

Comments: EMNLP 2024 Main. Project website: https://github.com/sled-group/Teachable_RL

arXiv:2410.17385 [pdf, other]

Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities

Authors: Zheyuan Zhang, Fengyuan Hu, Jayjun Lee, Freda Shi, Parisa Kordjamshidi, Joyce Chai, Ziqiao Ma

Abstract: Spatial expressions in situated communication can be ambiguous, as their meanings vary depending on the frames of reference (FoR) adopted by speakers and listeners. While spatial language understanding and reasoning by vision-language models (VLMs) have gained increasing attention, potential ambiguities in these models are still under-explored. To address this issue, we present the COnsistent Mult… ▽ More Spatial expressions in situated communication can be ambiguous, as their meanings vary depending on the frames of reference (FoR) adopted by speakers and listeners. While spatial language understanding and reasoning by vision-language models (VLMs) have gained increasing attention, potential ambiguities in these models are still under-explored. To address this issue, we present the COnsistent Multilingual Frame Of Reference Test (COMFORT), an evaluation protocol to systematically assess the spatial reasoning capabilities of VLMs. We evaluate nine state-of-the-art VLMs using COMFORT. Despite showing some alignment with English conventions in resolving ambiguities, our experiments reveal significant shortcomings of VLMs: notably, the models (1) exhibit poor robustness and consistency, (2) lack the flexibility to accommodate multiple FoRs, and (3) fail to adhere to language-specific or culture-specific conventions in cross-lingual tests, as English tends to dominate other languages. With a growing effort to align vision-language models with human cognitive intuitions, we call for more attention to the ambiguous nature and cross-cultural diversity of spatial reasoning. △ Less

Submitted 17 April, 2025; v1 submitted 22 October, 2024; originally announced October 2024.

Comments: Accepted to ICLR 2025 (Oral) | Project page: https://spatial-comfort.github.io/

arXiv:2410.05725 [pdf, other]

KnowledgeSG: Privacy-Preserving Synthetic Text Generation with Knowledge Distillation from Server

Authors: Wenhao Wang, Xiaoyu Liang, Rui Ye, Jingyi Chai, Siheng Chen, Yanfeng Wang

Abstract: The success of large language models (LLMs) facilitate many parties to fine-tune LLMs on their own private data. However, this practice raises privacy concerns due to the memorization of LLMs. Existing solutions, such as utilizing synthetic data for substitution, struggle to simultaneously improve performance and preserve privacy. They either rely on a local model for generation, resulting in a pe… ▽ More The success of large language models (LLMs) facilitate many parties to fine-tune LLMs on their own private data. However, this practice raises privacy concerns due to the memorization of LLMs. Existing solutions, such as utilizing synthetic data for substitution, struggle to simultaneously improve performance and preserve privacy. They either rely on a local model for generation, resulting in a performance decline, or take advantage of APIs, directly exposing the data to API servers. To address this issue, we propose KnowledgeSG, a novel client-server framework which enhances synthetic data quality and improves model performance while ensuring privacy. We achieve this by learning local knowledge from the private data with differential privacy (DP) and distilling professional knowledge from the server. Additionally, inspired by federated learning, we transmit models rather than data between the client and server to prevent privacy leakage. Extensive experiments in medical and financial domains demonstrate the effectiveness of KnowledgeSG. Our code is now publicly available at https://github.com/wwh0411/KnowledgeSG. △ Less

Submitted 9 October, 2024; v1 submitted 8 October, 2024; originally announced October 2024.

Comments: EMNLP 2024 Main

arXiv:2409.14674 [pdf, other]

RACER: Rich Language-Guided Failure Recovery Policies for Imitation Learning

Authors: Yinpei Dai, Jayjun Lee, Nima Fazeli, Joyce Chai

Abstract: Developing robust and correctable visuomotor policies for robotic manipulation is challenging due to the lack of self-recovery mechanisms from failures and the limitations of simple language instructions in guiding robot actions. To address these issues, we propose a scalable data generation pipeline that automatically augments expert demonstrations with failure recovery trajectories and fine-grai… ▽ More Developing robust and correctable visuomotor policies for robotic manipulation is challenging due to the lack of self-recovery mechanisms from failures and the limitations of simple language instructions in guiding robot actions. To address these issues, we propose a scalable data generation pipeline that automatically augments expert demonstrations with failure recovery trajectories and fine-grained language annotations for training. We then introduce Rich languAge-guided failure reCovERy (RACER), a supervisor-actor framework, which combines failure recovery data with rich language descriptions to enhance robot control. RACER features a vision-language model (VLM) that acts as an online supervisor, providing detailed language guidance for error correction and task execution, and a language-conditioned visuomotor policy as an actor to predict the next actions. Our experimental results show that RACER outperforms the state-of-the-art Robotic View Transformer (RVT) on RLbench across various evaluation settings, including standard long-horizon tasks, dynamic goal-change tasks and zero-shot unseen tasks, achieving superior performance in both simulated and real world environments. Videos and code are available at: https://rich-language-failure-recovery.github.io. △ Less

Submitted 22 September, 2024; originally announced September 2024.

Comments: Project Website: https://rich-language-failure-recovery.github.io

arXiv:2409.12485 [pdf]

doi 10.1021/acsnano.4c08554

Liquid Metal Oxide-assisted Integration of High-k Dielectrics and Metal Contacts for Two-Dimensional Electronics

Authors: Dasari Venkatakrishnarao, Abhishek Mishra, Yaoju Tarn, Michel Bosman, Rainer Lee, Sarthak Das, Subhrajit Mukherjee, Teymour Talha-Dean, Yiyu Zhang, Siew Lang Teo, Jian Wei Chai, Fabio Bussolotti, Kuan Eng Johnson Goh, Chit Siong Lau

Abstract: Two-dimensional van der Waals semiconductors are promising for future nanoelectronics. However, integrating high-k gate dielectrics for device applications is challenging as the inert van der Waals material surfaces hinder uniform dielectric growth. Here, we report a liquid metal oxide-assisted approach to integrate ultrathin, high-k HfO2 dielectric on 2D semiconductors with atomically smooth inte… ▽ More Two-dimensional van der Waals semiconductors are promising for future nanoelectronics. However, integrating high-k gate dielectrics for device applications is challenging as the inert van der Waals material surfaces hinder uniform dielectric growth. Here, we report a liquid metal oxide-assisted approach to integrate ultrathin, high-k HfO2 dielectric on 2D semiconductors with atomically smooth interfaces. Using this approach, we fabricated 2D WS2 top-gated transistors with subthreshold swings down to 74.5 mV/dec, gate leakage current density below 10-6 A/cm2, and negligible hysteresis. We further demonstrate a one-step van der Waals integration of contacts and dielectrics on graphene. This can offer a scalable approach toward integrating entire prefabricated device stack arrays with 2D materials. Our work provides a scalable solution to address the crucial dielectric engineering challenge for 2D semiconductors, paving the way for high-performance 2D electronics. △ Less

Submitted 19 September, 2024; originally announced September 2024.

Journal ref: ACS Nano, 2024

arXiv:2409.09936 [pdf, other]

doi 10.1103/PhysRevB.110.205301

The tuning of para- and diamagnetic cavity photon excitations in a square array of quantum dots in a magnetic field

Authors: Vidar Gudmundsson, Vram Mughnetsyan, Hsi-Sheng Goan, Jeng-Da Chai, Nzar Rauf Abdullah, Chi-Shung Tang, Valeriu Moldoveanu, Andrei Manolescu

Abstract: We employ a ``real-time'' excitation scheme to calculate the excitation spectra of a two-dimensional electron system in a square array of quantum dots placed in a circular cylindrical far-infrared photon cavity subjected to a perpendicular homogeneous external magnetic field. The Coulomb interaction of the electrons is handled via spin density functional theory and the para- and the diamagnetic pa… ▽ More We employ a ``real-time'' excitation scheme to calculate the excitation spectra of a two-dimensional electron system in a square array of quantum dots placed in a circular cylindrical far-infrared photon cavity subjected to a perpendicular homogeneous external magnetic field. The Coulomb interaction of the electrons is handled via spin density functional theory and the para- and the diamagnetic parts of the electron-photon coupling are updated according to a configuration interaction method in each iteration of the density functional calculation. The results show that an excitation scheme built on using the symmetry of the lateral square superlattice of the dots and the cylindrical cavity produces both para- and diamagnetic resonance peaks with oscillator strengths that can be steered by the excitation pulse parameters. The excitation method breaks the conditions for the generalized Kohn theorem and allows for insight into the subband structure of the electron system and can be used both in and outside the linear response regime. △ Less

Submitted 15 September, 2024; originally announced September 2024.

Comments: RevTeX - pdfLaTeX, 14 pages with 15 included pdf and png figures

Journal ref: Physical Review B 110, 205301 (2024)

arXiv:2409.07136 [pdf, other]

Leveraging Unstructured Text Data for Federated Instruction Tuning of Large Language Models

Authors: Rui Ye, Rui Ge, Yuchi Fengting, Jingyi Chai, Yanfeng Wang, Siheng Chen

Abstract: Federated instruction tuning enables multiple clients to collaboratively fine-tune a shared large language model (LLM) that can follow humans' instructions without directly sharing raw data. However, existing literature impractically requires that all the clients readily hold instruction-tuning data (i.e., structured instruction-response pairs), which necessitates massive human annotations since c… ▽ More Federated instruction tuning enables multiple clients to collaboratively fine-tune a shared large language model (LLM) that can follow humans' instructions without directly sharing raw data. However, existing literature impractically requires that all the clients readily hold instruction-tuning data (i.e., structured instruction-response pairs), which necessitates massive human annotations since clients' data is usually unstructured text instead. Addressing this, we propose a novel and flexible framework FedIT-U2S, which can automatically transform unstructured corpus into structured data for federated instruction tuning. FedIT-U2S consists two key steps: (1) few-shot instruction-tuning data generation, where each unstructured data piece together with several examples is combined to prompt an LLM in generating an instruction-response pair. To further enhance the flexibility, a retrieval-based example selection technique is proposed, where the examples are automatically selected based on the relatedness between the client's data piece and example pool, bypassing the need of determining examples in advance. (2) A typical federated instruction tuning process based on the generated data. Overall, FedIT-U2S can be applied to diverse scenarios as long as the client holds valuable text corpus, broadening the application scope of federated instruction tuning. We conduct a series of experiments on three domains (medicine, knowledge, and math), showing that our proposed FedIT-U2S can consistently and significantly brings improvement over the base LLM. △ Less

Submitted 11 September, 2024; originally announced September 2024.

Comments: 11 pages, work in progress

arXiv:2409.05847 [pdf, other]

LSVOS Challenge Report: Large-scale Complex and Long Video Object Segmentation

Authors: Henghui Ding, Lingyi Hong, Chang Liu, Ning Xu, Linjie Yang, Yuchen Fan, Deshui Miao, Yameng Gu, Xin Li, Zhenyu He, Yaowei Wang, Ming-Hsuan Yang, Jinming Chai, Qin Ma, Junpei Zhang, Licheng Jiao, Fang Liu, Xinyu Liu, Jing Zhang, Kexin Zhang, Xu Liu, LingLing Li, Hao Fang, Feiyu Pan, Xiankai Lu , et al. (8 additional authors not shown)

Abstract: Despite the promising performance of current video segmentation models on existing benchmarks, these models still struggle with complex scenes. In this paper, we introduce the 6th Large-scale Video Object Segmentation (LSVOS) challenge in conjunction with ECCV 2024 workshop. This year's challenge includes two tasks: Video Object Segmentation (VOS) and Referring Video Object Segmentation (RVOS). In… ▽ More Despite the promising performance of current video segmentation models on existing benchmarks, these models still struggle with complex scenes. In this paper, we introduce the 6th Large-scale Video Object Segmentation (LSVOS) challenge in conjunction with ECCV 2024 workshop. This year's challenge includes two tasks: Video Object Segmentation (VOS) and Referring Video Object Segmentation (RVOS). In this year, we replace the classic YouTube-VOS and YouTube-RVOS benchmark with latest datasets MOSE, LVOS, and MeViS to assess VOS under more challenging complex environments. This year's challenge attracted 129 registered teams from more than 20 institutes across over 8 countries. This report include the challenge and dataset introduction, and the methods used by top 7 teams in two tracks. More details can be found in our homepage https://lsvos.github.io/. △ Less

Submitted 9 September, 2024; originally announced September 2024.

Comments: ECCV 2024 LSVOS Challenge Report: https://lsvos.github.io/

arXiv:2409.02508 [pdf, other]

TLD: A Vehicle Tail Light signal Dataset and Benchmark

Authors: Jinhao Chai, Shiyi Mu, Shugong Xu

Abstract: Understanding other drivers' intentions is crucial for safe driving. The role of taillights in conveying these intentions is underemphasized in current autonomous driving systems. Accurately identifying taillight signals is essential for predicting vehicle behavior and preventing collisions. Open-source taillight datasets are scarce, often small and inconsistently annotated. To address this gap, w… ▽ More Understanding other drivers' intentions is crucial for safe driving. The role of taillights in conveying these intentions is underemphasized in current autonomous driving systems. Accurately identifying taillight signals is essential for predicting vehicle behavior and preventing collisions. Open-source taillight datasets are scarce, often small and inconsistently annotated. To address this gap, we introduce a new large-scale taillight dataset called TLD. Sourced globally, our dataset covers diverse traffic scenarios. To our knowledge, TLD is the first dataset to separately annotate brake lights and turn signals in real driving scenarios. We collected 17.78 hours of driving videos from the internet. This dataset consists of 152k labeled image frames sampled at a rate of 2 Hz, along with 1.5 million unlabeled frames interspersed throughout. Additionally, we have developed a two-stage vehicle light detection model consisting of two primary modules: a vehicle detector and a taillight classifier. Initially, YOLOv10 and DeepSORT captured consecutive vehicle images over time. Subsequently, the two classifiers work simultaneously to determine the states of the brake lights and turn signals. A post-processing procedure is then used to eliminate noise caused by misidentifications and provide the taillight states of the vehicle within a given time frame. Our method shows exceptional performance on our dataset, establishing a benchmark for vehicle taillight detection. The dataset is available at https://huggingface.co/datasets/ChaiJohn/TLD/tree/main △ Less

Submitted 4 September, 2024; originally announced September 2024.

arXiv:2408.13582 [pdf, other]

CSS-Segment: 2nd Place Report of LSVOS Challenge VOS Track

Authors: Jinming Chai, Qin Ma, Junpei Zhang, Licheng Jiao, Fang Liu

Abstract: Video object segmentation is a challenging task that serves as the cornerstone of numerous downstream applications, including video editing and autonomous driving. In this technical report, we briefly introduce the solution of our team "yuanjie" for video object segmentation in the 6-th LSVOS Challenge VOS Track at ECCV 2024. We believe that our proposed CSS-Segment will perform better in videos o… ▽ More Video object segmentation is a challenging task that serves as the cornerstone of numerous downstream applications, including video editing and autonomous driving. In this technical report, we briefly introduce the solution of our team "yuanjie" for video object segmentation in the 6-th LSVOS Challenge VOS Track at ECCV 2024. We believe that our proposed CSS-Segment will perform better in videos of complex object motion and long-term presentation. In this report, we successfully validated the effectiveness of the CSS-Segment in video object segmentation. Finally, our method achieved a J\&F score of 80.84 in and test phases, and ultimately ranked 2nd in the 6-th LSVOS Challenge VOS Track at ECCV 2024. △ Less

Submitted 24 August, 2024; originally announced August 2024.

arXiv:2407.10038 [pdf, ps, other]

Asai gamma factors over finite fields

Authors: Jingsong Chai

Abstract: In this note, we define and study Asai gamma factors over finite fields. We also prove some results about local Asai L-functions over p-adic fields for level zero representations. In this note, we define and study Asai gamma factors over finite fields. We also prove some results about local Asai L-functions over p-adic fields for level zero representations. △ Less

Submitted 13 July, 2024; originally announced July 2024.

arXiv:2407.07035 [pdf, other]

Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models

Authors: Yue Zhang, Ziqiao Ma, Jialu Li, Yanyuan Qiao, Zun Wang, Joyce Chai, Qi Wu, Mohit Bansal, Parisa Kordjamshidi

Abstract: Vision-and-Language Navigation (VLN) has gained increasing attention over recent years and many approaches have emerged to advance their development. The remarkable achievements of foundation models have shaped the challenges and proposed methods for VLN research. In this survey, we provide a top-down review that adopts a principled framework for embodied planning and reasoning, and emphasizes the… ▽ More Vision-and-Language Navigation (VLN) has gained increasing attention over recent years and many approaches have emerged to advance their development. The remarkable achievements of foundation models have shaped the challenges and proposed methods for VLN research. In this survey, we provide a top-down review that adopts a principled framework for embodied planning and reasoning, and emphasizes the current methods and future opportunities leveraging foundation models to address VLN challenges. We hope our in-depth discussions could provide valuable resources and insights: on one hand, to milestone the progress and explore opportunities and potential roles for foundation models in this field, and on the other, to organize different challenges and solutions in VLN to foundation model researchers. △ Less

Submitted 29 December, 2024; v1 submitted 9 July, 2024; originally announced July 2024.

Comments: Authors contributed equally to this work, and supervisors contributed equal advising to this work; GitHub repository: https://github.com/zhangyuejoslin/VLN-Survey-with-Foundation-Models

arXiv:2407.06192 [pdf, other]

Multi-Object Hallucination in Vision-Language Models

Authors: Xuweiyi Chen, Ziqiao Ma, Xuejun Zhang, Sihan Xu, Shengyi Qian, Jianing Yang, David F. Fouhey, Joyce Chai

Abstract: Large vision language models (LVLMs) often suffer from object hallucination, producing objects not present in the given images. While current benchmarks for object hallucination primarily concentrate on the presence of a single object class rather than individual entities, this work systematically investigates multi-object hallucination, examining how models misperceive (e.g., invent nonexistent o… ▽ More Large vision language models (LVLMs) often suffer from object hallucination, producing objects not present in the given images. While current benchmarks for object hallucination primarily concentrate on the presence of a single object class rather than individual entities, this work systematically investigates multi-object hallucination, examining how models misperceive (e.g., invent nonexistent objects or become distracted) when tasked with focusing on multiple objects simultaneously. We introduce Recognition-based Object Probing Evaluation (ROPE), an automated evaluation protocol that considers the distribution of object classes within a single image during testing and uses visual referring prompts to eliminate ambiguity. With comprehensive empirical studies and analysis of potential factors leading to multi-object hallucination, we found that (1). LVLMs suffer more hallucinations when focusing on multiple objects compared to a single object. (2). The tested object class distribution affects hallucination behaviors, indicating that LVLMs may follow shortcuts and spurious correlations. (3). Hallucinatory behaviors are influenced by data-specific factors, salience and frequency, and model intrinsic behaviors. We hope to enable LVLMs to recognize and reason about multiple objects that often occur in realistic visual scenes, provide insights, and quantify our progress towards mitigating the issues. △ Less

Submitted 31 October, 2024; v1 submitted 8 July, 2024; originally announced July 2024.

Comments: Accepted to NeurIPS 2024 | Project page: https://multi-object-hallucination.github.io/

arXiv:2406.17044 [pdf, other]

Fault-tolerant embedding of quantum circuits on hardware architectures via swap gates

Authors: Shao-Hen Chiew, Ezequiel Ignacio Rodriguez Chiacchio, Vishal Sharma, Jing Hao Chai, Hui Khoon Ng

Abstract: In near-term quantum computing devices, connectivity between qubits remain limited by architectural constraints. A computational circuit with given connectivity requirements necessary for multi-qubit gates have to be embedded within physical hardware with fixed connectivity. Long-distance gates have to be done by first routing the relevant qubits together. The simplest routing strategy involves th… ▽ More In near-term quantum computing devices, connectivity between qubits remain limited by architectural constraints. A computational circuit with given connectivity requirements necessary for multi-qubit gates have to be embedded within physical hardware with fixed connectivity. Long-distance gates have to be done by first routing the relevant qubits together. The simplest routing strategy involves the use of swap gates to swap the information carried by two unconnected qubits to connected ones. Ideal swap gates just permute the qubits; real swap gates, however, have the added possibilities of causing simultaneous errors on the qubits involved and spreading errors across the circuit. A general swap scheme thus changes the error-propagation properties of a circuit, including those necessary for fault-tolerant functioning of a circuit. Here, we present a simple strategy to design the swap scheme needed to embed an abstract circuit onto a physical hardware with constrained connectivity, in a manner that preserves the fault-tolerant properties of the abstract circuit. The embedded circuit will, of course, be noisier, compared to a native implementation of the abstract circuit, but we show in the examples of embedding surface codes on heavy-hexagonal and hexagonal lattices that the deterioration is not severe. This then offers a straightforward solution to implementing circuits with fault-tolerance properties on current hardware. △ Less

Submitted 24 June, 2024; originally announced June 2024.

arXiv:2406.15478 [pdf]

Impact of the Top SiO2 Interlayer Thickness on Memory Window of Si Channel FeFET with TiN/SiO2/Hf0.5Zr0.5O2/SiOx/Si (MIFIS) Gate Structure

Authors: Tao Hu, Xianzhou Shao, Mingkai Bai, Xinpei Jia, Saifei Dai, Xiaoqing Sun, Runhao Han, Jia Yang, Xiaoyu Ke, Fengbin Tian, Shuai Yang, Junshuai Chai, Hao Xu, Xiaolei Wang, Wenwu Wang, Tianchun Ye

Abstract: We study the impact of top SiO2 interlayer thickness on the memory window (MW) of Si channel ferroelectric field-effect transistor (FeFET) with TiN/SiO2/Hf0.5Zr0.5O2/SiOx/Si (MIFIS) gate structure. We find that the MW increases with the increasing thickness of the top SiO2 interlayer, and such an increase exhibits a two-stage linear dependence. The physical origin is the presence of the different… ▽ More We study the impact of top SiO2 interlayer thickness on the memory window (MW) of Si channel ferroelectric field-effect transistor (FeFET) with TiN/SiO2/Hf0.5Zr0.5O2/SiOx/Si (MIFIS) gate structure. We find that the MW increases with the increasing thickness of the top SiO2 interlayer, and such an increase exhibits a two-stage linear dependence. The physical origin is the presence of the different interfacial charges trapped at the top SiO2/Hf0.5Zr0.5O2 interface. Moreover, we investigate the dependence of endurance characteristics on initial MW. We find that the endurance characteristic degrades with increasing the initial MW. By inserting a 3.4 nm SiO2 dielectric interlayer between the gate metal TiN and the ferroelectric Hf0.5Zr0.5O2, we achieve a MW of 6.3 V and retention over 10 years. Our work is helpful in the device design of FeFET. △ Less

Submitted 16 June, 2024; originally announced June 2024.

Comments: 6 pages, 12 figures. arXiv admin note: substantial text overlap with arXiv:2404.15825

arXiv:2406.10630 [pdf, other]

Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models

Authors: Rui Ye, Jingyi Chai, Xiangrui Liu, Yaodong Yang, Yanfeng Wang, Siheng Chen

Abstract: Federated learning (FL) enables multiple parties to collaboratively fine-tune an large language model (LLM) without the need of direct data sharing. Ideally, by training on decentralized data that is aligned with human preferences and safety principles, federated instruction tuning can result in an LLM that could behave in a helpful and safe manner. In this paper, we for the first time reveal the… ▽ More Federated learning (FL) enables multiple parties to collaboratively fine-tune an large language model (LLM) without the need of direct data sharing. Ideally, by training on decentralized data that is aligned with human preferences and safety principles, federated instruction tuning can result in an LLM that could behave in a helpful and safe manner. In this paper, we for the first time reveal the vulnerability of safety alignment in FedIT by proposing a simple, stealthy, yet effective safety attack method. Specifically, the malicious clients could automatically generate attack data without involving manual efforts and attack the FedIT system by training their local LLMs on such attack data. Unfortunately, this proposed safety attack not only can compromise the safety alignment of LLM trained via FedIT, but also can not be effectively defended against by many existing FL defense methods. Targeting this, we further propose a post-hoc defense method, which could rely on a fully automated pipeline: generation of defense data and further fine-tuning of the LLM. Extensive experiments show that our safety attack method can significantly compromise the LLM's safety alignment (e.g., reduce safety rate by 70\%), which can not be effectively defended by existing defense methods (at most 4\% absolute improvement), while our safety defense method can significantly enhance the attacked LLM's safety alignment (at most 69\% absolute improvement). △ Less

Submitted 15 June, 2024; originally announced June 2024.

Comments: 18 pages

arXiv:2406.09264 [pdf, other]

Towards Bidirectional Human-AI Alignment: A Systematic Review for Clarifications, Framework, and Future Directions

Authors: Hua Shen, Tiffany Knearem, Reshmi Ghosh, Kenan Alkiek, Kundan Krishna, Yachuan Liu, Ziqiao Ma, Savvas Petridis, Yi-Hao Peng, Li Qiwei, Sushrita Rakshit, Chenglei Si, Yutong Xie, Jeffrey P. Bigham, Frank Bentley, Joyce Chai, Zachary Lipton, Qiaozhu Mei, Rada Mihalcea, Michael Terry, Diyi Yang, Meredith Ringel Morris, Paul Resnick, David Jurgens

Abstract: Recent advancements in general-purpose AI have highlighted the importance of guiding AI systems towards the intended goals, ethical principles, and values of individuals and groups, a concept broadly recognized as alignment. However, the lack of clarified definitions and scopes of human-AI alignment poses a significant obstacle, hampering collaborative efforts across research domains to achieve th… ▽ More Recent advancements in general-purpose AI have highlighted the importance of guiding AI systems towards the intended goals, ethical principles, and values of individuals and groups, a concept broadly recognized as alignment. However, the lack of clarified definitions and scopes of human-AI alignment poses a significant obstacle, hampering collaborative efforts across research domains to achieve this alignment. In particular, ML- and philosophy-oriented alignment research often views AI alignment as a static, unidirectional process (i.e., aiming to ensure that AI systems' objectives match humans) rather than an ongoing, mutual alignment problem. This perspective largely neglects the long-term interaction and dynamic changes of alignment. To understand these gaps, we introduce a systematic review of over 400 papers published between 2019 and January 2024, spanning multiple domains such as Human-Computer Interaction (HCI), Natural Language Processing (NLP), Machine Learning (ML). We characterize, define and scope human-AI alignment. From this, we present a conceptual framework of "Bidirectional Human-AI Alignment" to organize the literature from a human-centered perspective. This framework encompasses both 1) conventional studies of aligning AI to humans that ensures AI produces the intended outcomes determined by humans, and 2) a proposed concept of aligning humans to AI, which aims to help individuals and society adjust to AI advancements both cognitively and behaviorally. Additionally, we articulate the key findings derived from literature analysis, including literature gaps and trends, human values, and interaction techniques. To pave the way for future studies, we envision three key challenges and give recommendations for future research. △ Less

Submitted 10 August, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

Comments: proposing "bidirectional human-AI alignment" framework after a systematic review of over 400 alignment papers

arXiv:2406.05132 [pdf, other]

3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination

Authors: Jianing Yang, Xuweiyi Chen, Nikhil Madaan, Madhavan Iyengar, Shengyi Qian, David F. Fouhey, Joyce Chai

Abstract: The integration of language and 3D perception is crucial for embodied agents and robots that comprehend and interact with the physical world. While large language models (LLMs) have demonstrated impressive language understanding and generation capabilities, their adaptation to 3D environments (3D-LLMs) remains in its early stages. A primary challenge is a lack of large-scale datasets with dense gr… ▽ More The integration of language and 3D perception is crucial for embodied agents and robots that comprehend and interact with the physical world. While large language models (LLMs) have demonstrated impressive language understanding and generation capabilities, their adaptation to 3D environments (3D-LLMs) remains in its early stages. A primary challenge is a lack of large-scale datasets with dense grounding between language and 3D scenes. We introduce 3D-GRAND, a pioneering large-scale dataset comprising 40,087 household scenes paired with 6.2 million densely-grounded scene-language instructions. Our results show that instruction tuning with 3D-GRAND significantly enhances grounding capabilities and reduces hallucinations in 3D-LLMs. As part of our contributions, we propose a comprehensive benchmark 3D-POPE to systematically evaluate hallucination in 3D-LLMs, enabling fair comparisons of models. Our experiments highlight a scaling effect between dataset size and 3D-LLM performance, emphasizing the importance of large-scale 3D-text datasets for embodied AI research. Our results demonstrate early signals for effective sim-to-real transfer, indicating that models trained on large synthetic data can perform well on real-world 3D scans. Through 3D-GRAND and 3D-POPE, we aim to equip the embodied AI community with resources and insights to lead to more reliable and better-grounded 3D-LLMs. Project website: https://3d-grand.github.io △ Less

Submitted 20 March, 2025; v1 submitted 7 June, 2024; originally announced June 2024.

Comments: CVPR 2025. Project website: https://3d-grand.github.io

arXiv:2406.04845 [pdf, other]

FedLLM-Bench: Realistic Benchmarks for Federated Learning of Large Language Models

Authors: Rui Ye, Rui Ge, Xinyu Zhu, Jingyi Chai, Yaxin Du, Yang Liu, Yanfeng Wang, Siheng Chen

Abstract: Federated learning has enabled multiple parties to collaboratively train large language models without directly sharing their data (FedLLM). Following this training paradigm, the community has put massive efforts from diverse aspects including framework, performance, and privacy. However, an unpleasant fact is that there are currently no realistic datasets and benchmarks for FedLLM and previous wo… ▽ More Federated learning has enabled multiple parties to collaboratively train large language models without directly sharing their data (FedLLM). Following this training paradigm, the community has put massive efforts from diverse aspects including framework, performance, and privacy. However, an unpleasant fact is that there are currently no realistic datasets and benchmarks for FedLLM and previous works all rely on artificially constructed datasets, failing to capture properties in real-world scenarios. Addressing this, we propose FedLLM-Bench, which involves 8 training methods, 4 training datasets, and 6 evaluation metrics, to offer a comprehensive testbed for the FedLLM community. FedLLM-Bench encompasses three datasets (e.g., user-annotated multilingual dataset) for federated instruction tuning and one dataset (e.g., user-annotated preference dataset) for federated preference alignment, whose scale of client number ranges from 38 to 747. Our datasets incorporate several representative diversities: language, quality, quantity, instruction, length, embedding, and preference, capturing properties in real-world scenarios. Based on FedLLM-Bench, we conduct experiments on all datasets to benchmark existing FL methods and provide empirical insights (e.g., multilingual collaboration). We believe that our FedLLM-Bench can benefit the FedLLM community by reducing required efforts, providing a practical testbed, and promoting fair comparisons. Code and datasets are available at https://github.com/rui-ye/FedLLM-Bench. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: 22 pages

arXiv:2406.04640 [pdf, other]

LinkGPT: Teaching Large Language Models To Predict Missing Links

Authors: Zhongmou He, Jing Zhu, Shengyi Qian, Joyce Chai, Danai Koutra

Abstract: Large Language Models (LLMs) have shown promising results on various language and vision tasks. Recently, there has been growing interest in applying LLMs to graph-based tasks, particularly on Text-Attributed Graphs (TAGs). However, most studies have focused on node classification, while the use of LLMs for link prediction (LP) remains understudied. In this work, we propose a new task on LLMs, whe… ▽ More Large Language Models (LLMs) have shown promising results on various language and vision tasks. Recently, there has been growing interest in applying LLMs to graph-based tasks, particularly on Text-Attributed Graphs (TAGs). However, most studies have focused on node classification, while the use of LLMs for link prediction (LP) remains understudied. In this work, we propose a new task on LLMs, where the objective is to leverage LLMs to predict missing links between nodes in a graph. This task evaluates an LLM's ability to reason over structured data and infer new facts based on learned patterns. This new task poses two key challenges: (1) How to effectively integrate pairwise structural information into the LLMs, which is known to be crucial for LP performance, and (2) how to solve the computational bottleneck when teaching LLMs to perform LP. To address these challenges, we propose LinkGPT, the first end-to-end trained LLM for LP tasks. To effectively enhance the LLM's ability to understand the underlying structure, we design a two-stage instruction tuning approach where the first stage fine-tunes the pairwise encoder, projector, and node projector, and the second stage further fine-tunes the LLMs to predict links. To address the efficiency challenges at inference time, we introduce a retrieval-reranking scheme. Experiments show that LinkGPT can achieve state-of-the-art performance on real-world graphs as well as superior generalization in zero-shot and few-shot learning, surpassing existing benchmarks. At inference time, it can achieve $10\times$ speedup while maintaining high LP accuracy. △ Less

Submitted 7 June, 2024; originally announced June 2024.

arXiv:2406.03008 [pdf, other]

DriVLMe: Enhancing LLM-based Autonomous Driving Agents with Embodied and Social Experiences

Authors: Yidong Huang, Jacob Sansom, Ziqiao Ma, Felix Gervits, Joyce Chai

Abstract: Recent advancements in foundation models (FMs) have unlocked new prospects in autonomous driving, yet the experimental settings of these studies are preliminary, over-simplified, and fail to capture the complexity of real-world driving scenarios in human environments. It remains under-explored whether FM agents can handle long-horizon navigation tasks with free-from dialogue and deal with unexpect… ▽ More Recent advancements in foundation models (FMs) have unlocked new prospects in autonomous driving, yet the experimental settings of these studies are preliminary, over-simplified, and fail to capture the complexity of real-world driving scenarios in human environments. It remains under-explored whether FM agents can handle long-horizon navigation tasks with free-from dialogue and deal with unexpected situations caused by environmental dynamics or task changes. To explore the capabilities and boundaries of FMs faced with the challenges above, we introduce DriVLMe, a video-language-model-based agent to facilitate natural and effective communication between humans and autonomous vehicles that perceive the environment and navigate. We develop DriVLMe from both embodied experiences in a simulated environment and social experiences from real human dialogue. While DriVLMe demonstrates competitive performance in both open-loop benchmarks and closed-loop human studies, we reveal several limitations and challenges, including unacceptable inference time, imbalanced training data, limited visual understanding, challenges with multi-turn interactions, simplified language generation from robotic experiences, and difficulties in handling on-the-fly unexpected situations like environmental dynamics and task changes. △ Less

Submitted 15 October, 2024; v1 submitted 5 June, 2024; originally announced June 2024.

Comments: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

arXiv:2405.18256 [pdf]

doi 10.1021/acsnano.4c00422

Electrical Control Grain Dimensionality with Multilevel Magnetic Anisotropy

Authors: Shengyao Li, Sabpreet Bhatti, Siew Lang Teo, Ming Lin, Xinyue Pan, Zherui Yang, Peng Song, Wanghao Tian, Xinyu He, Jianwei Chai, Xian Jun Loh, Qiang Zhu, S. N. Piramanayagam, Xiao Renshaw Wang

Abstract: In alignment with the increasing demand for larger storage capacity and longer data retention, electrical control of magnetic anisotropy has been a research focus in the realm of spintronics. Typically, magnetic anisotropy is determined by grain dimensionality, which is set during the fabrication of magnetic thin films. Despite the intrinsic correlation between magnetic anisotropy and grain dimens… ▽ More In alignment with the increasing demand for larger storage capacity and longer data retention, electrical control of magnetic anisotropy has been a research focus in the realm of spintronics. Typically, magnetic anisotropy is determined by grain dimensionality, which is set during the fabrication of magnetic thin films. Despite the intrinsic correlation between magnetic anisotropy and grain dimensionality, there is a lack of experimental evidence for electrically controlling grain dimensionality, thereby impeding the efficiency of magnetic anisotropy modulation. Here, we demonstrate an electric field control of grain dimensionality and prove it as the active mechanism for tuning interfacial magnetism. The reduction in grain dimensionality is associated with a transition from ferromagnetic to superparamagnetic behavior. We achieve a non-volatile and reversible modulation of the coercivity in both the ferromagnetic and superparamagnetic regimes. Subsequent electrical and elemental analysis confirms the variation in grain dimensionality upon the application of gate voltages, revealing a transition from a multidomain to a single-domain state accompanied by a reduction in grain dimensionality. Furthermore, we exploit the influence of grain dimensionality on domain wall motion, extending its applicability to multilevel magnetic memory and synaptic devices. Our results provide a strategy for tuning interfacial magnetism through grain size engineering for advancements in high-performance spintronics. △ Less

Submitted 18 October, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

arXiv:2405.13828 [pdf, other]

Babysit A Language Model From Scratch: Interactive Language Learning by Trials and Demonstrations

Authors: Ziqiao Ma, Zekun Wang, Joyce Chai

Abstract: Humans are efficient language learners and inherently social creatures. Our language development is largely shaped by our social interactions, for example, the demonstration and feedback from caregivers. Contrary to human language learning, recent advancements in large language models have primarily adopted a non-interactive training paradigm, and refined pre-trained models through feedback afterw… ▽ More Humans are efficient language learners and inherently social creatures. Our language development is largely shaped by our social interactions, for example, the demonstration and feedback from caregivers. Contrary to human language learning, recent advancements in large language models have primarily adopted a non-interactive training paradigm, and refined pre-trained models through feedback afterward. In this work, we explore how corrective feedback from interactions influences neural language acquisition from scratch through systematically controlled experiments, assessing whether it contributes to word learning efficiency in language models. We introduce a trial-and-demonstration (TnD) learning framework that incorporates three distinct components: student trials, teacher demonstrations, and a reward conditioned on language competence at various developmental stages. Our experiments reveal that the TnD approach accelerates word acquisition for student models of equal and smaller numbers of parameters, and we highlight the significance of both trials and demonstrations. We further show that the teacher's choices of words influence students' word-specific learning efficiency, and a practice-makes-perfect effect is evident by a strong correlation between the frequency of words in trials and their respective learning curves. Our findings suggest that interactive language learning, with teacher demonstrations and active trials, can facilitate efficient word learning in language models. △ Less

Submitted 18 April, 2025; v1 submitted 22 May, 2024; originally announced May 2024.

Comments: NAACL 2025 (Main) & Workshop on Large Language Models and Cognition @ ICML 2024 (Oral)

arXiv:2405.09187 [pdf, ps, other]

doi 10.1103/PhysRevA.109.062808

Spin Symmetry in Thermally-Assisted-Occupation Density Functional Theory

Authors: Yu-Yang Wang, Jeng-Da Chai

Abstract: For electronic systems with multi-reference (MR) character, Kohn-Sham density functional theory (KS-DFT) with the conventional exchange-correlation (xc) energy functionals can lead to incorrect spin densities and related properties. For example, for H2 dissociation, the spin-restricted and spin-unrestricted solutions obtained with the same xc energy functional in KS-DFT can be distinctly different… ▽ More For electronic systems with multi-reference (MR) character, Kohn-Sham density functional theory (KS-DFT) with the conventional exchange-correlation (xc) energy functionals can lead to incorrect spin densities and related properties. For example, for H2 dissociation, the spin-restricted and spin-unrestricted solutions obtained with the same xc energy functional in KS-DFT can be distinctly different, yielding the unphysical spin-symmetry breaking effects in the spin-unrestricted solutions. Recently, thermally-assisted-occupation density functional theory (TAO-DFT) has been shown to resolve the aforementioned spin-symmetry breaking, when the fictitious temperature is properly chosen. In this work, a response theory based on TAO-DFT is developed to demonstrate that TAO-DFT with a sufficiently large fictitious temperature can always resolve the unphysical spin-symmetry breaking in MR systems. To further support this, TAO-DFT calculations with various fictitious temperatures are performed for the dissociation of H2, N2, He2, and Ne2 as well as the twisted ethylene. △ Less

Submitted 29 May, 2024; v1 submitted 15 May, 2024; originally announced May 2024.

Comments: accepted for publication in Phys. Rev. A, 23 pages, 5 figures

Journal ref: Phys. Rev. A 109, 062808 (2024)

arXiv:2404.15825 [pdf]

Impact of Top SiO2 interlayer Thickness on Memory Window of Si Channel FeFET with TiN/SiO2/Hf0.5Zr0.5O2/SiOx/Si (MIFIS) Gate Structure

Authors: Tao Hu, Xianzhou Shao, Mingkai Bai, Xinpei Jia, Saifei Dai, Xiaoqing Sun, Runhao Han, Jia Yang, Xiaoyu Ke, Fengbin Tian, Shuai Yang, Junshuai Chai, Hao Xu, Xiaolei Wang, Wenwu Wang, Tianchun Ye

Abstract: We study the impact of top SiO2 interlayer thickness on memory window of Si channel FeFET with TiN/SiO2/Hf0.5Zr0.5O2/SiOx/Si (MIFIS) gate structure. The memory window increases with thicker top SiO2. We realize the memory window of 6.3 V for 3.4 nm top SiO2. Moreover, we find that the endurance characteristic degrades with increasing the initial memory window. We study the impact of top SiO2 interlayer thickness on memory window of Si channel FeFET with TiN/SiO2/Hf0.5Zr0.5O2/SiOx/Si (MIFIS) gate structure. The memory window increases with thicker top SiO2. We realize the memory window of 6.3 V for 3.4 nm top SiO2. Moreover, we find that the endurance characteristic degrades with increasing the initial memory window. △ Less

Submitted 24 April, 2024; originally announced April 2024.

Comments: 4 page 7 figures

Showing 1–50 of 371 results for author: Chai, J