-
LLM-ABM for Transportation: Assessing the Potential of LLM Agents in System Analysis
Authors:
Tianming Liu,
Jirong Yang,
Yafeng Yin
Abstract:
Agent-based modeling approaches represent the state-of-art in modeling travel demand and transportation system dynamics and are valuable tools for transportation planning. However, established agent-based approaches in transportation rely on multi-hierarchical mathematical models to simulate travel behavior, which faces theoretical and practical limitations. The advent of large language models (LL…
▽ More
Agent-based modeling approaches represent the state-of-art in modeling travel demand and transportation system dynamics and are valuable tools for transportation planning. However, established agent-based approaches in transportation rely on multi-hierarchical mathematical models to simulate travel behavior, which faces theoretical and practical limitations. The advent of large language models (LLM) provides a new opportunity to refine agent-based modeling in transportation. LLM agents, which have impressive reasoning and planning abilities, can serve as a proxy of human travelers and be integrated into the modeling framework. However, despite evidence of their behavioral soundness, no existing studies have assessed the impact and validity of LLM-agent-based simulations from a system perspective in transportation. This paper aims to address this issue by designing and integrating LLM agents with human-traveler-like characteristics into a simulation of a transportation system and assessing its performance based on existing benchmarks. Using the classical transportation setting of the morning commute, we find that not only do the agents exhibit fine behavioral soundness, but also produce system dynamics that align well with standard benchmarks. Our analysis first verifies the effectiveness and potential of LLM-agent-based modeling for transportation planning on the system level.
△ Less
Submitted 25 March, 2025;
originally announced March 2025.
-
Segment then Splat: A Unified Approach for 3D Open-Vocabulary Segmentation based on Gaussian Splatting
Authors:
Yiren Lu,
Yunlai Zhou,
Yiran Qiao,
Chaoda Song,
Tuo Liang,
Jing Ma,
Yu Yin
Abstract:
Open-vocabulary querying in 3D space is crucial for enabling more intelligent perception in applications such as robotics, autonomous systems, and augmented reality. However, most existing methods rely on 2D pixel-level parsing, leading to multi-view inconsistencies and poor 3D object retrieval. Moreover, they are limited to static scenes and struggle with dynamic scenes due to the complexities of…
▽ More
Open-vocabulary querying in 3D space is crucial for enabling more intelligent perception in applications such as robotics, autonomous systems, and augmented reality. However, most existing methods rely on 2D pixel-level parsing, leading to multi-view inconsistencies and poor 3D object retrieval. Moreover, they are limited to static scenes and struggle with dynamic scenes due to the complexities of motion modeling. In this paper, we propose Segment then Splat, a 3D-aware open vocabulary segmentation approach for both static and dynamic scenes based on Gaussian Splatting. Segment then Splat reverses the long established approach of "segmentation after reconstruction" by dividing Gaussians into distinct object sets before reconstruction. Once the reconstruction is complete, the scene is naturally segmented into individual objects, achieving true 3D segmentation. This approach not only eliminates Gaussian-object misalignment issues in dynamic scenes but also accelerates the optimization process, as it eliminates the need for learning a separate language field. After optimization, a CLIP embedding is assigned to each object to enable open-vocabulary querying. Extensive experiments on various datasets demonstrate the effectiveness of our proposed method in both static and dynamic scenarios.
△ Less
Submitted 28 March, 2025;
originally announced March 2025.
-
Debate-Driven Multi-Agent LLMs for Phishing Email Detection
Authors:
Ngoc Tuong Vy Nguyen,
Felix D Childress,
Yunting Yin
Abstract:
Phishing attacks remain a critical cybersecurity threat. Attackers constantly refine their methods, making phishing emails harder to detect. Traditional detection methods, including rule-based systems and supervised machine learning models, either rely on predefined patterns like blacklists, which can be bypassed with slight modifications, or require large datasets for training and still can gener…
▽ More
Phishing attacks remain a critical cybersecurity threat. Attackers constantly refine their methods, making phishing emails harder to detect. Traditional detection methods, including rule-based systems and supervised machine learning models, either rely on predefined patterns like blacklists, which can be bypassed with slight modifications, or require large datasets for training and still can generate false positives and false negatives. In this work, we propose a multi-agent large language model (LLM) prompting technique that simulates debates among agents to detect whether the content presented on an email is phishing. Our approach uses two LLM agents to present arguments for or against the classification task, with a judge agent adjudicating the final verdict based on the quality of reasoning provided. This debate mechanism enables the models to critically analyze contextual cue and deceptive patterns in text, which leads to improved classification accuracy. The proposed framework is evaluated on multiple phishing email datasets and demonstrate that mixed-agent configurations consistently outperform homogeneous configurations. Results also show that the debate structure itself is sufficient to yield accurate decisions without extra prompting strategies.
△ Less
Submitted 27 March, 2025;
originally announced March 2025.
-
SWI: Speaking with Intent in Large Language Models
Authors:
Yuwei Yin,
EunJeong Hwang,
Giuseppe Carenini
Abstract:
Intent, typically clearly formulated and planned, functions as a cognitive framework for reasoning and problem-solving. This paper introduces the concept of Speaking with Intent (SWI) in large language models (LLMs), where the explicitly generated intent encapsulates the model's underlying intention and provides high-level planning to guide subsequent analysis and communication. By emulating delib…
▽ More
Intent, typically clearly formulated and planned, functions as a cognitive framework for reasoning and problem-solving. This paper introduces the concept of Speaking with Intent (SWI) in large language models (LLMs), where the explicitly generated intent encapsulates the model's underlying intention and provides high-level planning to guide subsequent analysis and communication. By emulating deliberate and purposeful thoughts in the human mind, SWI is hypothesized to enhance the reasoning capabilities and generation quality of LLMs. Extensive experiments on mathematical reasoning benchmarks consistently demonstrate the superiority of Speaking with Intent over Baseline (i.e., generation without explicit intent). Moreover, SWI outperforms answer-trigger prompting methods Chain-of-Thought and Plan-and-Solve and maintains competitive performance with the strong method ARR (Analyzing, Retrieving, and Reasoning). Additionally, the effectiveness and generalizability of SWI are solidified on reasoning-intensive question answering (QA) and text summarization benchmarks, where SWI brings consistent improvement to the Baseline generation. In text summarization, SWI-generated summaries exhibit greater accuracy, conciseness, and factual correctness, with fewer hallucinations. Furthermore, human evaluations verify the coherence, effectiveness, and interpretability of the intent produced by SWI. This proof-of-concept study creates a novel avenue for enhancing LLMs' reasoning abilities with cognitive notions.
△ Less
Submitted 27 March, 2025;
originally announced March 2025.
-
Dynamic Learning and Productivity for Data Analysts: A Bayesian Hidden Markov Model Perspective
Authors:
Yue Yin
Abstract:
Data analysts are essential in organizations, transforming raw data into insights that drive decision-making and strategy. This study explores how analysts' productivity evolves on a collaborative platform, focusing on two key learning activities: writing queries and viewing peer queries. While traditional research often assumes static models, where performance improves steadily with cumulative le…
▽ More
Data analysts are essential in organizations, transforming raw data into insights that drive decision-making and strategy. This study explores how analysts' productivity evolves on a collaborative platform, focusing on two key learning activities: writing queries and viewing peer queries. While traditional research often assumes static models, where performance improves steadily with cumulative learning, such models fail to capture the dynamic nature of real-world learning. To address this, we propose a Hidden Markov Model (HMM) that tracks how analysts transition between distinct learning states based on their participation in these activities.
Using an industry dataset with 2,001 analysts and 79,797 queries, this study identifies three learning states: novice, intermediate, and advanced. Productivity increases as analysts advance to higher states, reflecting the cumulative benefits of learning. Writing queries benefits analysts across all states, with the largest gains observed for novices. Viewing peer queries supports novices but may hinder analysts in higher states due to cognitive overload or inefficiencies. Transitions between states are also uneven, with progression from intermediate to advanced being particularly challenging. This study advances understanding of into dynamic learning behavior of knowledge worker and offers practical implications for designing systems, optimizing training, enabling personalized learning, and fostering effective knowledge sharing.
△ Less
Submitted 26 March, 2025;
originally announced March 2025.
-
LOCAL: A Graph-Based Active Learning Approach for Stability Analysis of DAC@NG Catalysts
Authors:
Yue Yin,
Jiangshan He,
Hai Xiao
Abstract:
Dual atomic catalysts supported by nitrogen-doped graphene (DAC@NG) offer significant potential in catalytic applications by overcoming intrinsic limitations associated with single atomic catalysts. However, accurately determining their stability and atomic-scale configurations remains computationally challenging due to extensive structural variability. In this study, we present the LOCalization a…
▽ More
Dual atomic catalysts supported by nitrogen-doped graphene (DAC@NG) offer significant potential in catalytic applications by overcoming intrinsic limitations associated with single atomic catalysts. However, accurately determining their stability and atomic-scale configurations remains computationally challenging due to extensive structural variability. In this study, we present the LOCalization and Active Learning (LOCAL) framework, an innovative, scalable approach employing two graph convolutional network (GCN) models (POS2COHP and Graph2E) to predict stability energies directly from initial DAC@NG structures. Leveraging an extensive dataset of 611,648 DAC@NG structures, encompassing 38 metal elements, six distinct graphene quadra-vacancy patterns, and diverse carbon/nitrogen coordination environments, LOCAL achieved a remarkable validation mean absolute error of just 0.145 eV. Utilizing this framework, we systematically analyzed stability trends across various metal pairs, successfully generating phase diagrams for experimentally validated bimetallic systems (Co-Ni, Fe-Ni, Fe-Mn, and Ag-Ni). These results underscore LOCAL's capability for rapidly evaluating structural stability, significantly accelerating the discovery and optimization of high-performance catalysts. The developed dataset and LOCAL framework are publicly available, offering a valuable resource for future catalyst design and broader exploration of catalytic materials.
△ Less
Submitted 25 March, 2025;
originally announced March 2025.
-
Oxidation States in Solids from Data-Driven Paradigms
Authors:
Yue Yin,
Hai Xiao
Abstract:
The oxidation state (OS) is an essential chemical concept that embodies chemical intuition but cannot be computed with well-defined physical laws. We establish a data-driven paradigm, with its implementation as Tsinghua Oxidation States in Solids (TOSS), to explicitly compute the OSs in crystal structures as the emergent properties from large-sized datasets based on Bayesian maximum a posteriori p…
▽ More
The oxidation state (OS) is an essential chemical concept that embodies chemical intuition but cannot be computed with well-defined physical laws. We establish a data-driven paradigm, with its implementation as Tsinghua Oxidation States in Solids (TOSS), to explicitly compute the OSs in crystal structures as the emergent properties from large-sized datasets based on Bayesian maximum a posteriori probability (MAP). TOSS employs two looping structures over the large-sized dataset of crystal structures to obtain an emergent library of distance distributions as the foundation for chemically intuitive understanding and then determine the OSs by minimizing a loss function for each structure based on MAP and distance distributions in the whole dataset. The application of TOSS to a dataset of $\mathrm{>}$1,000,000 crystal structures delivers a superior success rate, and using the resulting OSs as the dataset, we further train a data-driven alternative to TOSS based on graph convolutional networks. We expect TOSS and the ML-model-based alternative to find a wide spectrum of applications, and this work also demonstrates an encouraging example for the data-driven paradigms to explicitly compute the chemical intuition for tackling complex problems in chemistry.
△ Less
Submitted 25 March, 2025;
originally announced March 2025.
-
MuMA: 3D PBR Texturing via Multi-Channel Multi-View Generation and Agentic Post-Processing
Authors:
Lingting Zhu,
Jingrui Ye,
Runze Zhang,
Zeyu Hu,
Yingda Yin,
Lanjiong Li,
Jinnan Chen,
Shengju Qian,
Xin Wang,
Qingmin Liao,
Lequan Yu
Abstract:
Current methods for 3D generation still fall short in physically based rendering (PBR) texturing, primarily due to limited data and challenges in modeling multi-channel materials. In this work, we propose MuMA, a method for 3D PBR texturing through Multi-channel Multi-view generation and Agentic post-processing. Our approach features two key innovations: 1) We opt to model shaded and albedo appear…
▽ More
Current methods for 3D generation still fall short in physically based rendering (PBR) texturing, primarily due to limited data and challenges in modeling multi-channel materials. In this work, we propose MuMA, a method for 3D PBR texturing through Multi-channel Multi-view generation and Agentic post-processing. Our approach features two key innovations: 1) We opt to model shaded and albedo appearance channels, where the shaded channels enables the integration intrinsic decomposition modules for material properties. 2) Leveraging multimodal large language models, we emulate artists' techniques for material assessment and selection. Experiments demonstrate that MuMA achieves superior results in visual quality and material fidelity compared to existing methods.
△ Less
Submitted 24 March, 2025;
originally announced March 2025.
-
Praxis-VLM: Vision-Grounded Decision Making via Text-Driven Reinforcement Learning
Authors:
Zhe Hu,
Jing Li,
Zhongzhu Pu,
Hou Pong Chan,
Yu Yin
Abstract:
Vision Language Models exhibited immense potential for embodied AI, yet they often lack the sophisticated situational reasoning required for complex decision-making. This paper shows that VLMs can achieve surprisingly strong decision-making performance when visual scenes are represented merely as text-only descriptions, suggesting foundational reasoning can be effectively learned from language. Mo…
▽ More
Vision Language Models exhibited immense potential for embodied AI, yet they often lack the sophisticated situational reasoning required for complex decision-making. This paper shows that VLMs can achieve surprisingly strong decision-making performance when visual scenes are represented merely as text-only descriptions, suggesting foundational reasoning can be effectively learned from language. Motivated by this insight, we propose Praxis-VLM, a reasoning VLM for vision-grounded decision-making. Praxis-VLM employs the GRPO algorithm on textual scenarios to instill robust reasoning capabilities, where models learn to evaluate actions and their consequences. These reasoning skills, acquired purely from text, successfully transfer to multimodal inference with visual inputs, significantly reducing reliance on scarce paired image-text training data. Experiments across diverse decision-making benchmarks demonstrate that Praxis-VLM substantially outperforms standard supervised fine-tuning, exhibiting superior performance and generalizability. Further analysis confirms that our models engage in explicit and effective reasoning, underpinning their enhanced performance and adaptability.
△ Less
Submitted 22 May, 2025; v1 submitted 21 March, 2025;
originally announced March 2025.
-
Thermal resonance-enhanced transparency in room temperature Rydberg gases
Authors:
Jinlian Hu,
Yuechun Jiao,
Yuwen Yin,
Cheng Lu,
Jingxu Bai,
Suotang Jia,
Weibin Li,
Zhengyang Bai,
Jianming Zhao
Abstract:
We report the enhanced optical transmission in the coherent, off-resonant excitation of Rydberg atom gases at room temperature via a two-photon process. Here thermal resonance-enhanced transparency (TRET) is induced when the detuning of the two lasers is adjusted to compensate the atomic thermal-motion-induced energy shifts, i.e. single and two-photon Doppler shifts. We show that the atomic veloci…
▽ More
We report the enhanced optical transmission in the coherent, off-resonant excitation of Rydberg atom gases at room temperature via a two-photon process. Here thermal resonance-enhanced transparency (TRET) is induced when the detuning of the two lasers is adjusted to compensate the atomic thermal-motion-induced energy shifts, i.e. single and two-photon Doppler shifts. We show that the atomic velocity is mapped into the transmission of the probe fields, which can be altered by independently and selectively exciting different velocity groups through sweeping the detuning. The maximal transmission in TRET is about 8 times higher than that under the electromagnetically induced transparency (EIT). Utilizing the TRET effect, we enhance the sensitivity of a Rydberg microwave receiver to be 28.7~nVcm$^{-1}$Hz$^{-1/2}$, ultimately reaching a factor of 2.1 of the EIT case. When atoms of separate velocity groups are excited simultaneously by multiple sets of detuned lasers, the receiver sensitivity further increases, which is linearly proportional to the number of the velocity groups. Our study paves a way to exploit light-matter interaction via the TRET, and contributes to current efforts in developing quantum sensing, primary gas thermometry, and wireless communication with room-temperature atomic gases.
△ Less
Submitted 20 March, 2025;
originally announced March 2025.
-
BARD-GS: Blur-Aware Reconstruction of Dynamic Scenes via Gaussian Splatting
Authors:
Yiren Lu,
Yunlai Zhou,
Disheng Liu,
Tuo Liang,
Yu Yin
Abstract:
3D Gaussian Splatting (3DGS) has shown remarkable potential for static scene reconstruction, and recent advancements have extended its application to dynamic scenes. However, the quality of reconstructions depends heavily on high-quality input images and precise camera poses, which are not that trivial to fulfill in real-world scenarios. Capturing dynamic scenes with handheld monocular cameras, fo…
▽ More
3D Gaussian Splatting (3DGS) has shown remarkable potential for static scene reconstruction, and recent advancements have extended its application to dynamic scenes. However, the quality of reconstructions depends heavily on high-quality input images and precise camera poses, which are not that trivial to fulfill in real-world scenarios. Capturing dynamic scenes with handheld monocular cameras, for instance, typically involves simultaneous movement of both the camera and objects within a single exposure. This combined motion frequently results in image blur that existing methods cannot adequately handle. To address these challenges, we introduce BARD-GS, a novel approach for robust dynamic scene reconstruction that effectively handles blurry inputs and imprecise camera poses. Our method comprises two main components: 1) camera motion deblurring and 2) object motion deblurring. By explicitly decomposing motion blur into camera motion blur and object motion blur and modeling them separately, we achieve significantly improved rendering results in dynamic regions. In addition, we collect a real-world motion blur dataset of dynamic scenes to evaluate our approach. Extensive experiments demonstrate that BARD-GS effectively reconstructs high-quality dynamic scenes under realistic conditions, significantly outperforming existing methods.
△ Less
Submitted 20 March, 2025;
originally announced March 2025.
-
$3d$ flat bands and coupled $4f$ moments in the kagome-honeycomb permanent magnet Sm$_{2}$Co$_{17}$
Authors:
Hao Zheng,
Zhiguang Xiao,
Ze Pan,
Guowei Yang,
Yonghao Liu,
Jianzhou Bian,
Yi Wu,
Teng Hua,
Jiawen Zhang,
Jiayi Lu,
Jiong Li,
Tulai Sun,
Yu Song,
Ruihua He,
J. Larrea Jiménez,
Guanghan Cao,
Huiqiu Yuan,
Yuanfeng Xu,
Yi Yin,
Ming Shi,
Chao Cao,
Yang Liu
Abstract:
Rare earth permanent magnets (REPMs) with both localized moments and itinerant conduction bands are not only important for fundamental research but also have significant technological applications. In particular, Sm$_{\rm 2}$Co$_{\rm 17}$ is a prototypical high-temperture REPM, where the Co atoms form a kagome-honeycomb stacked lattice. Here we report synthesis of epitaxial Sm$_{\rm 2}$Co…
▽ More
Rare earth permanent magnets (REPMs) with both localized moments and itinerant conduction bands are not only important for fundamental research but also have significant technological applications. In particular, Sm$_{\rm 2}$Co$_{\rm 17}$ is a prototypical high-temperture REPM, where the Co atoms form a kagome-honeycomb stacked lattice. Here we report synthesis of epitaxial Sm$_{\rm 2}$Co$_{\rm 17}$ films using molecular beam epitaxy and measurements of their momentum-resolved electronic structure from \textit{in-situ} angle-resolved photoemission spectroscopy. Our results unveil two flat bands from Co $3d$ orbitals near the Fermi level ($E_F$), one at $\sim$\,--300\,meV and another right at $E_F$, which arise from orbital-selective destructive interference and strong electron correlations, respectively. In addition, our results reveal that Sm $4f$ states are far away from $E_F$ (hence mostly localized) and exhibit an anomalous temperature dependence, caused by the $3d$-$4f$ magnetic coupling. Our findings provide direct spectroscopic insights to understand the strong uniaxial ferromagnetism in Sm$_{\rm 2}$Co$_{\rm 17}$ (and REPMs alike). Our work also opens avenues to explore flat-band physics near $E_F$ and emergent phenomena in correlated kagome-honeycomb lattices.
△ Less
Submitted 19 May, 2025; v1 submitted 17 March, 2025;
originally announced March 2025.
-
Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation Model
Authors:
Haoyang Huang,
Guoqing Ma,
Nan Duan,
Xing Chen,
Changyi Wan,
Ranchen Ming,
Tianyu Wang,
Bo Wang,
Zhiying Lu,
Aojie Li,
Xianfang Zeng,
Xinhao Zhang,
Gang Yu,
Yuhe Yin,
Qiling Wu,
Wen Sun,
Kang An,
Xin Han,
Deshan Sun,
Wei Ji,
Bizhu Huang,
Brian Li,
Chenfei Wu,
Guanzhe Huang,
Huixin Xiong
, et al. (29 additional authors not shown)
Abstract:
We present Step-Video-TI2V, a state-of-the-art text-driven image-to-video generation model with 30B parameters, capable of generating videos up to 102 frames based on both text and image inputs. We build Step-Video-TI2V-Eval as a new benchmark for the text-driven image-to-video task and compare Step-Video-TI2V with open-source and commercial TI2V engines using this dataset. Experimental results de…
▽ More
We present Step-Video-TI2V, a state-of-the-art text-driven image-to-video generation model with 30B parameters, capable of generating videos up to 102 frames based on both text and image inputs. We build Step-Video-TI2V-Eval as a new benchmark for the text-driven image-to-video task and compare Step-Video-TI2V with open-source and commercial TI2V engines using this dataset. Experimental results demonstrate the state-of-the-art performance of Step-Video-TI2V in the image-to-video generation task. Both Step-Video-TI2V and Step-Video-TI2V-Eval are available at https://github.com/stepfun-ai/Step-Video-TI2V.
△ Less
Submitted 14 March, 2025;
originally announced March 2025.
-
PlanGen: Towards Unified Layout Planning and Image Generation in Auto-Regressive Vision Language Models
Authors:
Runze He,
Bo Cheng,
Yuhang Ma,
Qingxiang Jia,
Shanyuan Liu,
Ao Ma,
Xiaoyu Wu,
Liebucha Wu,
Dawei Leng,
Yuhui Yin
Abstract:
In this paper, we propose a unified layout planning and image generation model, PlanGen, which can pre-plan spatial layout conditions before generating images. Unlike previous diffusion-based models that treat layout planning and layout-to-image as two separate models, PlanGen jointly models the two tasks into one autoregressive transformer using only next-token prediction. PlanGen integrates layo…
▽ More
In this paper, we propose a unified layout planning and image generation model, PlanGen, which can pre-plan spatial layout conditions before generating images. Unlike previous diffusion-based models that treat layout planning and layout-to-image as two separate models, PlanGen jointly models the two tasks into one autoregressive transformer using only next-token prediction. PlanGen integrates layout conditions into the model as context without requiring specialized encoding of local captions and bounding box coordinates, which provides significant advantages over the previous embed-and-pool operations on layout conditions, particularly when dealing with complex layouts. Unified prompting allows PlanGen to perform multitasking training related to layout, including layout planning, layout-to-image generation, image layout understanding, etc. In addition, PlanGen can be seamlessly expanded to layout-guided image manipulation thanks to the well-designed modeling, with teacher-forcing content manipulation policy and negative layout guidance. Extensive experiments verify the effectiveness of our PlanGen in multiple layoutrelated tasks, showing its great potential. Code is available at: https://360cvgroup.github.io/PlanGen.
△ Less
Submitted 30 March, 2025; v1 submitted 13 March, 2025;
originally announced March 2025.
-
Spin density matrix for neutral $ρ$ mesons in a pion gas in linear response theory
Authors:
Yi-Liang Yin,
Wen-Bo Dong,
Cong Yi,
Qun Wang
Abstract:
We calculate the spin density matrix for neutral $ρ$ mesons from the spectral function and thermal shear tensor by Kubo formula in the linear response theory, which contributes to the $γ$ correlator for the CME search. We derive the spectral function of neutral $ρ$ mesons with $ρππ$ and $ρρππ$ interactions using the Dyson-Schwinger equation. The thermal shear tensor contribution is obtained from t…
▽ More
We calculate the spin density matrix for neutral $ρ$ mesons from the spectral function and thermal shear tensor by Kubo formula in the linear response theory, which contributes to the $γ$ correlator for the CME search. We derive the spectral function of neutral $ρ$ mesons with $ρππ$ and $ρρππ$ interactions using the Dyson-Schwinger equation. The thermal shear tensor contribution is obtained from the Kubo formula in the linear response theory. We numerically calculate $ρ_{00}-1/3$ and $\mathrm{Re}ρ_{-1,1}$ using the simulation results for the thermal shear tensor by the hydrodynamical model, which are of the order $10^{-3}\sim10^{-2}$.
△ Less
Submitted 12 March, 2025;
originally announced March 2025.
-
NAMI: Efficient Image Generation via Progressive Rectified Flow Transformers
Authors:
Yuhang Ma,
Bo Cheng,
Shanyuan Liu,
Ao Ma,
Xiaoyu Wu,
Liebucha Wu,
Dawei Leng,
Yuhui Yin
Abstract:
Flow-based transformer models for image generation have achieved state-of-the-art performance with larger model parameters, but their inference deployment cost remains high. To enhance inference performance while maintaining generation quality, we propose progressive rectified flow transformers. We divide the rectified flow into different stages according to resolution, using fewer transformer lay…
▽ More
Flow-based transformer models for image generation have achieved state-of-the-art performance with larger model parameters, but their inference deployment cost remains high. To enhance inference performance while maintaining generation quality, we propose progressive rectified flow transformers. We divide the rectified flow into different stages according to resolution, using fewer transformer layers at the low-resolution stages to generate image layouts and concept contours, and progressively adding more layers as the resolution increases. Experiments demonstrate that our approach achieves fast convergence and reduces inference time while ensuring generation quality. The main contributions of this paper are summarized as follows: (1) We introduce progressive rectified flow transformers that enable multi-resolution training, accelerating model convergence; (2) NAMI leverages piecewise flow and spatial cascading of Diffusion Transformer (DiT) to rapidly generate images, reducing inference time by 40% to generate a 1024 resolution image; (3) We propose NAMI-1K benchmark to evaluate human preference performance, aiming to mitigate distributional bias and prevent data leakage from open-source benchmarks. The results show that our model is competitive with state-of-the-art models.
△ Less
Submitted 12 March, 2025;
originally announced March 2025.
-
U-StyDiT: Ultra-high Quality Artistic Style Transfer Using Diffusion Transformers
Authors:
Zhanjie Zhang,
Ao Ma,
Ke Cao,
Jing Wang,
Shanyuan Liu,
Yuhang Ma,
Bo Cheng,
Dawei Leng,
Yuhui Yin
Abstract:
Ultra-high quality artistic style transfer refers to repainting an ultra-high quality content image using the style information learned from the style image. Existing artistic style transfer methods can be categorized into style reconstruction-based and content-style disentanglement-based style transfer approaches. Although these methods can generate some artistic stylized images, they still exhib…
▽ More
Ultra-high quality artistic style transfer refers to repainting an ultra-high quality content image using the style information learned from the style image. Existing artistic style transfer methods can be categorized into style reconstruction-based and content-style disentanglement-based style transfer approaches. Although these methods can generate some artistic stylized images, they still exhibit obvious artifacts and disharmonious patterns, which hinder their ability to produce ultra-high quality artistic stylized images. To address these issues, we propose a novel artistic image style transfer method, U-StyDiT, which is built on transformer-based diffusion (DiT) and learns content-style disentanglement, generating ultra-high quality artistic stylized images. Specifically, we first design a Multi-view Style Modulator (MSM) to learn style information from a style image from local and global perspectives, conditioning U-StyDiT to generate stylized images with the learned style information. Then, we introduce a StyDiT Block to learn content and style conditions simultaneously from a style image. Additionally, we propose an ultra-high quality artistic image dataset, Aes4M, comprising 10 categories, each containing 400,000 style images. This dataset effectively solves the problem that the existing style transfer methods cannot produce high-quality artistic stylized images due to the size of the dataset and the quality of the images in the dataset. Finally, the extensive qualitative and quantitative experiments validate that our U-StyDiT can create higher quality stylized images compared to state-of-the-art artistic style transfer methods. To our knowledge, our proposed method is the first to address the generation of ultra-high quality stylized images using transformer-based diffusion.
△ Less
Submitted 11 March, 2025;
originally announced March 2025.
-
WISA: World Simulator Assistant for Physics-Aware Text-to-Video Generation
Authors:
Jing Wang,
Ao Ma,
Ke Cao,
Jun Zheng,
Zhanjie Zhang,
Jiasong Feng,
Shanyuan Liu,
Yuhang Ma,
Bo Cheng,
Dawei Leng,
Yuhui Yin,
Xiaodan Liang
Abstract:
Recent rapid advancements in text-to-video (T2V) generation, such as SoRA and Kling, have shown great potential for building world simulators. However, current T2V models struggle to grasp abstract physical principles and generate videos that adhere to physical laws. This challenge arises primarily from a lack of clear guidance on physical information due to a significant gap between abstract phys…
▽ More
Recent rapid advancements in text-to-video (T2V) generation, such as SoRA and Kling, have shown great potential for building world simulators. However, current T2V models struggle to grasp abstract physical principles and generate videos that adhere to physical laws. This challenge arises primarily from a lack of clear guidance on physical information due to a significant gap between abstract physical principles and generation models. To this end, we introduce the World Simulator Assistant (WISA), an effective framework for decomposing and incorporating physical principles into T2V models. Specifically, WISA decomposes physical principles into textual physical descriptions, qualitative physical categories, and quantitative physical properties. To effectively embed these physical attributes into the generation process, WISA incorporates several key designs, including Mixture-of-Physical-Experts Attention (MoPA) and a Physical Classifier, enhancing the model's physics awareness. Furthermore, most existing datasets feature videos where physical phenomena are either weakly represented or entangled with multiple co-occurring processes, limiting their suitability as dedicated resources for learning explicit physical principles. We propose a novel video dataset, WISA-32K, collected based on qualitative physical categories. It consists of 32,000 videos, representing 17 physical laws across three domains of physics: dynamics, thermodynamics, and optics. Experimental results demonstrate that WISA can effectively enhance the compatibility of T2V models with real-world physical laws, achieving a considerable improvement on the VideoPhy benchmark. The visual exhibitions of WISA and WISA-32K are available in the https://360cvgroup.github.io/WISA/.
△ Less
Submitted 11 March, 2025;
originally announced March 2025.
-
MergeQuant: Accurate 4-bit Static Quantization of Large Language Models by Channel-wise Calibration
Authors:
Jinguang Wang,
Jingyu Wang,
Haifeng Sun,
Tingting Yang,
Zirui Zhuang,
Wanyi Ning,
Yuexi Yin,
Qi Qi,
Jianxin Liao
Abstract:
Quantization has been widely used to compress and accelerate inference of large language models (LLMs). Existing methods focus on exploring the per-token dynamic calibration to ensure both inference acceleration and model accuracy under 4-bit quantization. However, in autoregressive generation inference of long sequences, the overhead of repeated dynamic quantization and dequantization steps becom…
▽ More
Quantization has been widely used to compress and accelerate inference of large language models (LLMs). Existing methods focus on exploring the per-token dynamic calibration to ensure both inference acceleration and model accuracy under 4-bit quantization. However, in autoregressive generation inference of long sequences, the overhead of repeated dynamic quantization and dequantization steps becomes considerably expensive. In this work, we propose MergeQuant, an accurate and efficient per-channel static quantization framework. MergeQuant integrates the per-channel quantization steps with the corresponding scalings and linear mappings through a Quantization Step Migration (QSM) method, thereby eliminating the quantization overheads before and after matrix multiplication. Furthermore, in view of the significant differences between the different channel ranges, we propose dimensional reconstruction and adaptive clipping to address the non-uniformity of quantization scale factors and redistribute the channel variations to the subsequent modules to balance the parameter distribution under QSM. Within the static quantization setting of W4A4, MergeQuant reduces the accuracy gap on zero-shot tasks compared to FP16 baseline to 1.3 points on Llama-2-70B model. On Llama-2-7B model, MergeQuant achieves up to 1.77x speedup in decoding, and up to 2.06x speedup in end-to-end compared to FP16 baseline.
△ Less
Submitted 6 March, 2025;
originally announced March 2025.
-
Can Atomic Step Decomposition Enhance the Self-structured Reasoning of Multimodal Large Models?
Authors:
Kun Xiang,
Zhili Liu,
Zihao Jiang,
Yunshuang Nie,
Kaixin Cai,
Yiyang Yin,
Runhui Huang,
Haoxiang Fan,
Hanhui Li,
Weiran Huang,
Yihan Zeng,
Yu-Jie Yuan,
Jianhua Han,
Lanqing Hong,
Hang Xu,
Xiaodan Liang
Abstract:
In this paper, we address the challenging task of multimodal mathematical reasoning by incorporating the ability of "slow thinking" into multimodal large language models (MLLMs). Our core idea is that different levels of reasoning abilities can be combined dynamically to tackle questions with different complexity. To this end, we propose a paradigm of Self-structured Chain of Thought (SCoT), which…
▽ More
In this paper, we address the challenging task of multimodal mathematical reasoning by incorporating the ability of "slow thinking" into multimodal large language models (MLLMs). Our core idea is that different levels of reasoning abilities can be combined dynamically to tackle questions with different complexity. To this end, we propose a paradigm of Self-structured Chain of Thought (SCoT), which is composed of minimal semantic atomic steps. Different from existing methods that rely on structured templates or free-form paradigms, our method can not only generate cognitive CoT structures for various complex tasks but also mitigates the phenomenon of overthinking. To introduce structured reasoning capabilities into visual understanding models, we further design a novel AtomThink framework with four key modules, including (i) a data engine to generate high-quality multimodal reasoning paths; (ii) a supervised fine-tuning process with serialized inference data; (iii) a policy-guided multi-turn inference method; and (iv) an atomic capability metric to evaluate the single step utilization rate. We conduct extensive experiments to show that the proposed AtomThink significantly improves the performance of baseline MLLMs, achieving more than 10\% average accuracy gains on MathVista and MathVerse. Compared to state-of-the-art structured CoT approaches, our method not only achieves higher accuracy but also improves data utilization by 5 times and boosts inference efficiency by 85.3\%. Our code is now public available in https://github.com/Quinn777/AtomThink.
△ Less
Submitted 8 March, 2025;
originally announced March 2025.
-
CAUSAL3D: A Comprehensive Benchmark for Causal Learning from Visual Data
Authors:
Disheng Liu,
Yiran Qiao,
Wuche Liu,
Yiren Lu,
Yunlai Zhou,
Tuo Liang,
Yu Yin,
Jing Ma
Abstract:
True intelligence hinges on the ability to uncover and leverage hidden causal relations. Despite significant progress in AI and computer vision (CV), there remains a lack of benchmarks for assessing models' abilities to infer latent causality from complex visual data. In this paper, we introduce \textsc{\textbf{Causal3D}}, a novel and comprehensive benchmark that integrates structured data (tables…
▽ More
True intelligence hinges on the ability to uncover and leverage hidden causal relations. Despite significant progress in AI and computer vision (CV), there remains a lack of benchmarks for assessing models' abilities to infer latent causality from complex visual data. In this paper, we introduce \textsc{\textbf{Causal3D}}, a novel and comprehensive benchmark that integrates structured data (tables) with corresponding visual representations (images) to evaluate causal reasoning. Designed within a systematic framework, Causal3D comprises 19 3D-scene datasets capturing diverse causal relations, views, and backgrounds, enabling evaluations across scenes of varying complexity. We assess multiple state-of-the-art methods, including classical causal discovery, causal representation learning, and large/vision-language models (LLMs/VLMs). Our experiments show that as causal structures grow more complex without prior knowledge, performance declines significantly, highlighting the challenges even advanced methods face in complex causal scenarios. Causal3D serves as a vital resource for advancing causal reasoning in CV and fostering trustworthy AI in critical domains.
△ Less
Submitted 5 March, 2025;
originally announced March 2025.
-
The Aesthetic Imperative of Lev Landau's Geometric Reductionism in Theoretical Physics
Authors:
Jingxu Wu,
Yuwei Yin
Abstract:
This paper explores the ontological and epistemological foundations of Lev Landau's theoretical physics through the lens of his unpublished philosophical notes and scientific practice. We identify a unique form of geometric reductionism where physical laws emerge as inevitable consequences of symmetry breaking in progressively constrained phase spaces. Landau's dismissal of quantum interpretation…
▽ More
This paper explores the ontological and epistemological foundations of Lev Landau's theoretical physics through the lens of his unpublished philosophical notes and scientific practice. We identify a unique form of geometric reductionism where physical laws emerge as inevitable consequences of symmetry breaking in progressively constrained phase spaces. Landau's dismissal of quantum interpretation debates and his famous "axiomatic minimalism" in the Course of Theoretical Physics are shown to stem from a deep epistemological commitment to dimensional aesthetics - the belief that fundamental truths must manifest through dimensional economy in mathematical representations.
△ Less
Submitted 21 February, 2025;
originally announced March 2025.
-
Lost in Literalism: How Supervised Training Shapes Translationese in LLMs
Authors:
Yafu Li,
Ronghao Zhang,
Zhilin Wang,
Huajian Zhang,
Leyang Cui,
Yongjing Yin,
Tong Xiao,
Yue Zhang
Abstract:
Large language models (LLMs) have achieved remarkable success in machine translation, demonstrating impressive performance across diverse languages. However, translationese, characterized by overly literal and unnatural translations, remains a persistent challenge in LLM-based translation systems. Despite their pre-training on vast corpora of natural utterances, LLMs exhibit translationese errors…
▽ More
Large language models (LLMs) have achieved remarkable success in machine translation, demonstrating impressive performance across diverse languages. However, translationese, characterized by overly literal and unnatural translations, remains a persistent challenge in LLM-based translation systems. Despite their pre-training on vast corpora of natural utterances, LLMs exhibit translationese errors and generate unexpected unnatural translations, stemming from biases introduced during supervised fine-tuning (SFT). In this work, we systematically evaluate the prevalence of translationese in LLM-generated translations and investigate its roots during supervised training. We introduce methods to mitigate these biases, including polishing golden references and filtering unnatural training instances. Empirical evaluations demonstrate that these approaches significantly reduce translationese while improving translation naturalness, validated by human evaluations and automatic metrics. Our findings highlight the need for training-aware adjustments to optimize LLM translation outputs, paving the way for more fluent and target-language-consistent translations. We release the data and code at https://github.com/yafuly/LLM_Translationese.
△ Less
Submitted 6 March, 2025;
originally announced March 2025.
-
EP240801a/XRF 240801B: An X-ray Flash Detected by the Einstein Probe and Implications of its Multiband Afterglow
Authors:
Shuai-Qing Jiang,
Dong Xu,
Agnes P. C. van Hoof,
Wei-Hua Lei,
Yuan Liu,
Hao Zhou,
Yong Chen,
Shao-Yu Fu,
Jun Yang,
Xing Liu,
Zi-Pei Zhu,
Alexei V. Filippenko,
Peter G. Jonker,
A. S. Pozanenko,
He Gao,
Xue-Feng Wu,
Bing Zhang,
Gavin P Lamb,
Massimiliano De Pasquale,
Shiho Kobayashi,
Franz Erik Bauer,
Hui Sun,
Giovanna Pugliese,
Jie An,
Valerio D'Elia
, et al. (67 additional authors not shown)
Abstract:
We present multiband observations and analysis of EP240801a, a low-energy, extremely soft gamma-ray burst (GRB) discovered on August 1, 2024 by the Einstein Probe (EP) satellite, with a weak contemporaneous signal also detected by Fermi/GBM. Optical spectroscopy of the afterglow, obtained by GTC and Keck, identified the redshift of $z = 1.6734$. EP240801a exhibits a burst duration of 148 s in X-ra…
▽ More
We present multiband observations and analysis of EP240801a, a low-energy, extremely soft gamma-ray burst (GRB) discovered on August 1, 2024 by the Einstein Probe (EP) satellite, with a weak contemporaneous signal also detected by Fermi/GBM. Optical spectroscopy of the afterglow, obtained by GTC and Keck, identified the redshift of $z = 1.6734$. EP240801a exhibits a burst duration of 148 s in X-rays and 22.3 s in gamma-rays, with X-rays leading by 80.61 s. Spectral lag analysis indicates the gamma-ray signal arrived 8.3 s earlier than the X-rays. Joint spectral fitting of EP/WXT and Fermi/GBM data yields an isotropic energy $E_{γ,\rm{iso}} = (5.57^{+0.54}_{-0.50})\times 10^{51}\,\rm{erg}$, a peak energy $E_{\rm{peak}} = 14.90^{+7.08}_{-4.71}\,\rm{keV}$, a fluence ratio $\rm S(25-50\,\rm{keV})/S(50-100\,\rm{keV}) = 1.67^{+0.74}_{-0.46}$, classifying EP240801a as an X-ray flash (XRF). The host-galaxy continuum spectrum, inferred using Prospector, was used to correct its contribution for the observed outburst optical data. Unusual early $R$-band behavior and EP/FXT observations suggest multiple components in the afterglow. Three models are considered: two-component jet model, forward-reverse shock model and forward-shock model with energy injection. Both three provide reasonable explanations. The two-component jet model and the energy injection model imply a relatively small initial energy and velocity of the jet in the line of sight, while the forward-reverse shock model remains typical. Under the two-component jet model, EP240801a may resemble GRB 221009A (BOAT) if the bright narrow beam is viewed on-axis. Therefore, EP240801a can be interpreted as an off-beam (narrow) jet or an intrinsically weak GRB jet. Our findings provide crucial clues for uncovering the origin of XRFs.
△ Less
Submitted 6 March, 2025;
originally announced March 2025.
-
TAIL: Text-Audio Incremental Learning
Authors:
Yingfei Sun,
Xu Gu,
Wei Ji,
Hanbin Zhao,
Hao Fei,
Yifang Yin,
Roger Zimmermann
Abstract:
Many studies combine text and audio to capture multi-modal information but they overlook the model's generalization ability on new datasets. Introducing new datasets may affect the feature space of the original dataset, leading to catastrophic forgetting. Meanwhile, large model parameters can significantly impact training performance. To address these limitations, we introduce a novel task called…
▽ More
Many studies combine text and audio to capture multi-modal information but they overlook the model's generalization ability on new datasets. Introducing new datasets may affect the feature space of the original dataset, leading to catastrophic forgetting. Meanwhile, large model parameters can significantly impact training performance. To address these limitations, we introduce a novel task called Text-Audio Incremental Learning (TAIL) task for text-audio retrieval, and propose a new method, PTAT, Prompt Tuning for Audio-Text incremental learning. This method utilizes prompt tuning to optimize the model parameters while incorporating an audio-text similarity and feature distillation module to effectively mitigate catastrophic forgetting. We benchmark our method and previous incremental learning methods on AudioCaps, Clotho, BBC Sound Effects and Audioset datasets, and our method outperforms previous methods significantly, particularly demonstrating stronger resistance to forgetting on older datasets. Compared to the full-parameters Finetune (Sequential) method, our model only requires 2.42\% of its parameters, achieving 4.46\% higher performance.
△ Less
Submitted 6 March, 2025;
originally announced March 2025.
-
Efficient neural topology optimization via active learning for enhancing turbulent mass transfer in fluid channels
Authors:
Chenhui Kou,
Yuhui Yin,
Min Zhu,
Shengkun Jia,
Yiqing Luo,
Xigang Yuana,
Lu Lu
Abstract:
The design of fluid channel structures of reactors or separators of chemical processes is key to enhancing the mass transfer processes inside the devices. However, the systematic design of channel topological structures is difficult for complex turbulent flows. Here, we address this challenge by developing a machine learning framework to efficiently perform topology optimization of channel structu…
▽ More
The design of fluid channel structures of reactors or separators of chemical processes is key to enhancing the mass transfer processes inside the devices. However, the systematic design of channel topological structures is difficult for complex turbulent flows. Here, we address this challenge by developing a machine learning framework to efficiently perform topology optimization of channel structures for turbulent mass transfer. We represent a topological structure using a neural network (referred to as `neural topology', which is optimized by employing pre-trained neural operators combined with a fine-tuning strategy with active data augmentation. The optimization is performed with two objectives: maximization of mass transfer efficiency and minimization of energy consumption, for the possible considerations of compromise between the two in real-world designs. The developed neural operator with active learning is data efficient in network training and demonstrates superior computational efficiency compared with traditional methods in obtaining optimal structures across a large design space. The optimization results are validated through experiments, proving that the optimized channel improves concentration uniformity by 37% compared with the original channel. We also demonstrate the variation of the optimal structures with changes in inlet velocity conditions, providing a reference for designing turbulent mass-transfer devices under different operating conditions.
△ Less
Submitted 5 March, 2025;
originally announced March 2025.
-
KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding
Authors:
Zhangchen Xu,
Yang Liu,
Yueqin Yin,
Mingyuan Zhou,
Radha Poovendran
Abstract:
We introduce KodCode, a synthetic dataset that addresses the persistent challenge of acquiring high-quality, verifiable training data across diverse difficulties and domains for training Large Language Models for coding. Existing code-focused resources typically fail to ensure either the breadth of coverage (e.g., spanning simple coding tasks to advanced algorithmic problems) or verifiable correct…
▽ More
We introduce KodCode, a synthetic dataset that addresses the persistent challenge of acquiring high-quality, verifiable training data across diverse difficulties and domains for training Large Language Models for coding. Existing code-focused resources typically fail to ensure either the breadth of coverage (e.g., spanning simple coding tasks to advanced algorithmic problems) or verifiable correctness (e.g., unit tests). In contrast, KodCode comprises question-solution-test triplets that are systematically validated via a self-verification procedure. Our pipeline begins by synthesizing a broad range of coding questions, then generates solutions and test cases with additional attempts allocated to challenging problems. Finally, post-training data synthesis is done by rewriting questions into diverse formats and generating responses under a test-based reject sampling procedure from a reasoning model (DeepSeek R1). This pipeline yields a large-scale, robust and diverse coding dataset. KodCode is suitable for supervised fine-tuning and the paired unit tests also provide great potential for RL tuning. Fine-tuning experiments on coding benchmarks (HumanEval(+), MBPP(+), BigCodeBench, and LiveCodeBench) demonstrate that KodCode-tuned models achieve state-of-the-art performance, surpassing models like Qwen2.5-Coder-32B-Instruct and DeepSeek-R1-Distill-Llama-70B.
△ Less
Submitted 4 March, 2025;
originally announced March 2025.
-
Med-LEGO: Editing and Adapting toward Generalist Medical Image Diagnosis
Authors:
Yitao Zhu,
Yuan Yin,
Jiaming Li,
Mengjie Xu,
Zihao Zhao,
Honglin Xiong,
Sheng Wang,
Qian Wang
Abstract:
The adoption of visual foundation models has become a common practice in computer-aided diagnosis (CAD). While these foundation models provide a viable solution for creating generalist medical AI, privacy concerns make it difficult to pre-train or continuously update such models across multiple domains and datasets, leading many studies to focus on specialist models. To address this challenge, we…
▽ More
The adoption of visual foundation models has become a common practice in computer-aided diagnosis (CAD). While these foundation models provide a viable solution for creating generalist medical AI, privacy concerns make it difficult to pre-train or continuously update such models across multiple domains and datasets, leading many studies to focus on specialist models. To address this challenge, we propose Med-LEGO, a training-free framework that enables the seamless integration or updating of a generalist CAD model by combining multiple specialist models, similar to assembling LEGO bricks. Med-LEGO enhances LoRA (low-rank adaptation) by incorporating singular value decomposition (SVD) to efficiently capture the domain expertise of each specialist model with minimal additional parameters. By combining these adapted weights through simple operations, Med-LEGO allows for the easy integration or modification of specific diagnostic capabilities without the need for original data or retraining. Finally, the combined model can be further adapted to new diagnostic tasks, making it a versatile generalist model. Our extensive experiments demonstrate that Med-LEGO outperforms existing methods in both cross-domain and in-domain medical tasks while using only 0.18% of full model parameters. These merged models show better convergence and generalization to new tasks, providing an effective path toward generalist medical AI.
△ Less
Submitted 2 March, 2025;
originally announced March 2025.
-
DPR: Diffusion Preference-based Reward for Offline Reinforcement Learning
Authors:
Teng Pang,
Bingzheng Wang,
Guoqiang Wu,
Yilong Yin
Abstract:
Offline preference-based reinforcement learning (PbRL) mitigates the need for reward definition, aligning with human preferences via preference-driven reward feedback without interacting with the environment. However, the effectiveness of preference-driven reward functions depends on the modeling ability of the learning model, which current MLP-based and Transformer-based methods may fail to adequ…
▽ More
Offline preference-based reinforcement learning (PbRL) mitigates the need for reward definition, aligning with human preferences via preference-driven reward feedback without interacting with the environment. However, the effectiveness of preference-driven reward functions depends on the modeling ability of the learning model, which current MLP-based and Transformer-based methods may fail to adequately provide. To alleviate the failure of the reward function caused by insufficient modeling, we propose a novel preference-based reward acquisition method: Diffusion Preference-based Reward (DPR). Unlike previous methods using Bradley-Terry models for trajectory preferences, we use diffusion models to directly model preference distributions for state-action pairs, allowing rewards to be discriminatively obtained from these distributions. In addition, considering the particularity of preference data that only know the internal relationships of paired trajectories, we further propose Conditional Diffusion Preference-based Reward (C-DPR), which leverages relative preference information to enhance the construction of the diffusion model. We apply the above methods to existing offline reinforcement learning algorithms and a series of experiment results demonstrate that the diffusion-based reward acquisition approach outperforms previous MLP-based and Transformer-based methods.
△ Less
Submitted 13 May, 2025; v1 submitted 2 March, 2025;
originally announced March 2025.
-
Re-Evaluating the Impact of Unseen-Class Unlabeled Data on Semi-Supervised Learning Model
Authors:
Rundong He,
Yicong Dong,
Lanzhe Guo,
Yilong Yin,
Tailin Wu
Abstract:
Semi-supervised learning (SSL) effectively leverages unlabeled data and has been proven successful across various fields. Current safe SSL methods believe that unseen classes in unlabeled data harm the performance of SSL models. However, previous methods for assessing the impact of unseen classes on SSL model performance are flawed. They fix the size of the unlabeled dataset and adjust the proport…
▽ More
Semi-supervised learning (SSL) effectively leverages unlabeled data and has been proven successful across various fields. Current safe SSL methods believe that unseen classes in unlabeled data harm the performance of SSL models. However, previous methods for assessing the impact of unseen classes on SSL model performance are flawed. They fix the size of the unlabeled dataset and adjust the proportion of unseen classes within the unlabeled data to assess the impact. This process contravenes the principle of controlling variables. Adjusting the proportion of unseen classes in unlabeled data alters the proportion of seen classes, meaning the decreased classification performance of seen classes may not be due to an increase in unseen class samples in the unlabeled data, but rather a decrease in seen class samples. Thus, the prior flawed assessment standard that ``unseen classes in unlabeled data can damage SSL model performance" may not always hold true. This paper strictly adheres to the principle of controlling variables, maintaining the proportion of seen classes in unlabeled data while only changing the unseen classes across five critical dimensions, to investigate their impact on SSL models from global robustness and local robustness. Experiments demonstrate that unseen classes in unlabeled data do not necessarily impair the performance of SSL models; in fact, under certain conditions, unseen classes may even enhance them.
△ Less
Submitted 2 March, 2025;
originally announced March 2025.
-
G-OSR: A Comprehensive Benchmark for Graph Open-Set Recognition
Authors:
Yicong Dong,
Rundong He,
Guangyao Chen,
Wentao Zhang,
Zhongyi Han,
Jieming Shi,
Yilong Yin
Abstract:
Graph Neural Networks (GNNs) have achieved significant success in machine learning, with wide applications in social networks, bioinformatics, knowledge graphs, and other fields. Most research assumes ideal closed-set environments. However, in real-world open-set environments, graph learning models face challenges in robustness and reliability due to unseen classes. This highlights the need for Gr…
▽ More
Graph Neural Networks (GNNs) have achieved significant success in machine learning, with wide applications in social networks, bioinformatics, knowledge graphs, and other fields. Most research assumes ideal closed-set environments. However, in real-world open-set environments, graph learning models face challenges in robustness and reliability due to unseen classes. This highlights the need for Graph Open-Set Recognition (GOSR) methods to address these issues and ensure effective GNN application in practical scenarios. Research in GOSR is in its early stages, with a lack of a comprehensive benchmark spanning diverse tasks and datasets to evaluate methods. Moreover, traditional methods, Graph Out-of-Distribution Detection (GOODD), GOSR, and Graph Anomaly Detection (GAD) have mostly evolved in isolation, with little exploration of their interconnections or potential applications to GOSR. To fill these gaps, we introduce \textbf{G-OSR}, a comprehensive benchmark for evaluating GOSR methods at both the node and graph levels, using datasets from multiple domains to ensure fair and standardized comparisons of effectiveness and efficiency across traditional, GOODD, GOSR, and GAD methods. The results offer critical insights into the generalizability and limitations of current GOSR methods and provide valuable resources for advancing research in this field through systematic analysis of diverse approaches.
△ Less
Submitted 1 March, 2025;
originally announced March 2025.
-
The $s\pm$ pairing symmetry in the pressured La$_3$Ni$_2$O$_7$ from electron-phonon coupling
Authors:
Yucong Yin,
Jun Zhan,
Boyang Liu,
Xinloong Han
Abstract:
The recently discovered bilayer Ruddlesden-Popper nickelate La$_3$Ni$_2$O$_7$ exhibits superconductivity with a remarkable transition temperature $T_c\approx 80 $ K under applied pressures above 14.0 GPa. This discovery of new family of high-temperature superconductors has garnered significant attention in the condensed matter physics community. In this work, we assume the this high-temperature su…
▽ More
The recently discovered bilayer Ruddlesden-Popper nickelate La$_3$Ni$_2$O$_7$ exhibits superconductivity with a remarkable transition temperature $T_c\approx 80 $ K under applied pressures above 14.0 GPa. This discovery of new family of high-temperature superconductors has garnered significant attention in the condensed matter physics community. In this work, we assume the this high-temperature superconductor is mediated by phonons and investigate the pairing symmetry in two distinct models: (i) the full-coupling case, where the Ni-$d_{x^2-y^2}$ and Ni-$d_{3z^2-r^2}$ orbitals are treated equally in both interlayer and intralayer coupling interactions, and (ii) the half-coupling case, where the intralayer coupling involves only the $d_{x^2-y^2}$ orbital, while the interlayer coupling is restricted to the $d_{3z^2-r^2}$ orbital. Our calculations reveal that the interlayer coupling favors an $s\pm$-wave superconducting state, whereas the intralayer coupling promotes an $s++$-wave symmetry. Additionally, we discuss the implications of pair-hopping interactions on the superconducting properties. These findings provide valuable insights into the pairing mechanisms and symmetry of this newly discovered high-temperature superconductor.
△ Less
Submitted 28 February, 2025;
originally announced February 2025.
-
OntologyRAG: Better and Faster Biomedical Code Mapping with Retrieval-Augmented Generation (RAG) Leveraging Ontology Knowledge Graphs and Large Language Models
Authors:
Hui Feng,
Yuntzu Yin,
Emiliano Reynares,
Jay Nanavati
Abstract:
Biomedical ontologies, which comprehensively define concepts and relations for biomedical entities, are crucial for structuring and formalizing domain-specific information representations. Biomedical code mapping identifies similarity or equivalence between concepts from different ontologies. Obtaining high-quality mapping usually relies on automatic generation of unrefined mapping with ontology d…
▽ More
Biomedical ontologies, which comprehensively define concepts and relations for biomedical entities, are crucial for structuring and formalizing domain-specific information representations. Biomedical code mapping identifies similarity or equivalence between concepts from different ontologies. Obtaining high-quality mapping usually relies on automatic generation of unrefined mapping with ontology domain fine-tuned language models (LMs), followed by manual selections or corrections by coding experts who have extensive domain expertise and familiarity with ontology schemas. The LMs usually provide unrefined code mapping suggestions as a list of candidates without reasoning or supporting evidence, hence coding experts still need to verify each suggested candidate against ontology sources to pick the best matches. This is also a recurring task as ontology sources are updated regularly to incorporate new research findings. Consequently, the need of regular LM retraining and manual refinement make code mapping time-consuming and labour intensive. In this work, we created OntologyRAG, an ontology-enhanced retrieval-augmented generation (RAG) method that leverages the inductive biases from ontological knowledge graphs for in-context-learning (ICL) in large language models (LLMs). Our solution grounds LLMs to knowledge graphs with unrefined mappings between ontologies and processes questions by generating an interpretable set of results that include prediction rational with mapping proximity assessment. Our solution doesn't require re-training LMs, as all ontology updates could be reflected by updating the knowledge graphs with a standard process. Evaluation results on a self-curated gold dataset show promises of using our method to enable coding experts to achieve better and faster code mapping. The code is available at https://github.com/iqvianlp/ontologyRAG.
△ Less
Submitted 26 February, 2025;
originally announced February 2025.
-
Ion counting and temperature determination of Coulomb-crystallized laser-cooled ions in traps using convolutional neural networks
Authors:
Yanning Yin,
Stefan Willitsch
Abstract:
Coulomb crystals, ordered structures of cold ions confined in ion traps, find applications in a variety of research fields. The number and temperature of the ions forming the Coulomb crystals are two key attributes of interest in many trapped-ion experiments. Here, we present a fast and accurate approach of determining these attributes from fluorescence images of the ions based on convolutional ne…
▽ More
Coulomb crystals, ordered structures of cold ions confined in ion traps, find applications in a variety of research fields. The number and temperature of the ions forming the Coulomb crystals are two key attributes of interest in many trapped-ion experiments. Here, we present a fast and accurate approach of determining these attributes from fluorescence images of the ions based on convolutional neural networks (CNNs). In this approach, we first generate a large number of images of Coulomb crystals with different ion numbers and temperatures using molecular-dynamics simulations and then train CNN models on these images to classify the desired attributes. The classification performance of several common pre-trained CNN models was compared in example tasks. We find that for crystals with ion numbers in the range 100-299 and secular temperatures of 5-15 mK, the best-performing model can discern number variations on the level of one ion with an accuracy of 93% and temperature variations by 1 mK with an accuracy of 92%. Since the trained model can be directly integrated into experiments, in-situ determination of these attributes can be realized in a non-invasive fashion, which has the potential to greatly facilitate the analysis and control of trapped ions in real time.
△ Less
Submitted 25 February, 2025;
originally announced February 2025.
-
HVIS: A Human-like Vision and Inference System for Human Motion Prediction
Authors:
Kedi Lyu,
Haipeng Chen,
Zhenguang Liu,
Yifang Yin,
Yukang Lin,
Yingying Jiao
Abstract:
Grasping the intricacies of human motion, which involve perceiving spatio-temporal dependence and multi-scale effects, is essential for predicting human motion. While humans inherently possess the requisite skills to navigate this issue, it proves to be markedly more challenging for machines to emulate. To bridge the gap, we propose the Human-like Vision and Inference System (HVIS) for human motio…
▽ More
Grasping the intricacies of human motion, which involve perceiving spatio-temporal dependence and multi-scale effects, is essential for predicting human motion. While humans inherently possess the requisite skills to navigate this issue, it proves to be markedly more challenging for machines to emulate. To bridge the gap, we propose the Human-like Vision and Inference System (HVIS) for human motion prediction, which is designed to emulate human observation and forecast future movements. HVIS comprises two components: the human-like vision encode (HVE) module and the human-like motion inference (HMI) module. The HVE module mimics and refines the human visual process, incorporating a retina-analog component that captures spatiotemporal information separately to avoid unnecessary crosstalk. Additionally, a visual cortex-analogy component is designed to hierarchically extract and treat complex motion features, focusing on both global and local features of human poses. The HMI is employed to simulate the multi-stage learning model of the human brain. The spontaneous learning network simulates the neuronal fracture generation process for the adversarial generation of future motions. Subsequently, the deliberate learning network is optimized for hard-to-train joints to prevent misleading learning. Experimental results demonstrate that our method achieves new state-of-the-art performance, significantly outperforming existing methods by 19.8% on Human3.6M, 15.7% on CMU Mocap, and 11.1% on G3D.
△ Less
Submitted 24 February, 2025;
originally announced February 2025.
-
ATRI: Mitigating Multilingual Audio Text Retrieval Inconsistencies by Reducing Data Distribution Errors
Authors:
Yuguo Yin,
Yuxin Xie,
Wenyuan Yang,
Dongchao Yang,
Jinghan Ru,
Xianwei Zhuang,
Liming Liang,
Yuexian Zou
Abstract:
Multilingual audio-text retrieval (ML-ATR) is a challenging task that aims to retrieve audio clips or multilingual texts from databases. However, existing ML-ATR schemes suffer from inconsistencies for instance similarity matching across languages. We theoretically analyze the inconsistency in terms of both multilingual modal alignment direction error and weight error, and propose the theoretical…
▽ More
Multilingual audio-text retrieval (ML-ATR) is a challenging task that aims to retrieve audio clips or multilingual texts from databases. However, existing ML-ATR schemes suffer from inconsistencies for instance similarity matching across languages. We theoretically analyze the inconsistency in terms of both multilingual modal alignment direction error and weight error, and propose the theoretical weight error upper bound for quantifying the inconsistency. Based on the analysis of the weight error upper bound, we find that the inconsistency problem stems from the data distribution error caused by random sampling of languages. We propose a consistent ML-ATR scheme using 1-to-k contrastive learning and audio-English co-anchor contrastive learning, aiming to mitigate the negative impact of data distribution error on recall and consistency in ML-ATR. Experimental results on the translated AudioCaps and Clotho datasets show that our scheme achieves state-of-the-art performance on recall and consistency metrics for eight mainstream languages, including English. Our code will be available at https://github.com/ATRI-ACL/ATRI-ACL.
△ Less
Submitted 4 June, 2025; v1 submitted 20 February, 2025;
originally announced February 2025.
-
RelaCtrl: Relevance-Guided Efficient Control for Diffusion Transformers
Authors:
Ke Cao,
Jing Wang,
Ao Ma,
Jiasong Feng,
Zhanjie Zhang,
Xuanhua He,
Shanyuan Liu,
Bo Cheng,
Dawei Leng,
Yuhui Yin,
Jie Zhang
Abstract:
The Diffusion Transformer plays a pivotal role in advancing text-to-image and text-to-video generation, owing primarily to its inherent scalability. However, existing controlled diffusion transformer methods incur significant parameter and computational overheads and suffer from inefficient resource allocation due to their failure to account for the varying relevance of control information across…
▽ More
The Diffusion Transformer plays a pivotal role in advancing text-to-image and text-to-video generation, owing primarily to its inherent scalability. However, existing controlled diffusion transformer methods incur significant parameter and computational overheads and suffer from inefficient resource allocation due to their failure to account for the varying relevance of control information across different transformer layers. To address this, we propose the Relevance-Guided Efficient Controllable Generation framework, RelaCtrl, enabling efficient and resource-optimized integration of control signals into the Diffusion Transformer. First, we evaluate the relevance of each layer in the Diffusion Transformer to the control information by assessing the "ControlNet Relevance Score"-i.e., the impact of skipping each control layer on both the quality of generation and the control effectiveness during inference. Based on the strength of the relevance, we then tailor the positioning, parameter scale, and modeling capacity of the control layers to reduce unnecessary parameters and redundant computations. Additionally, to further improve efficiency, we replace the self-attention and FFN in the commonly used copy block with the carefully designed Two-Dimensional Shuffle Mixer (TDSM), enabling efficient implementation of both the token mixer and channel mixer. Both qualitative and quantitative experimental results demonstrate that our approach achieves superior performance with only 15% of the parameters and computational complexity compared to PixArt-delta.
△ Less
Submitted 23 March, 2025; v1 submitted 20 February, 2025;
originally announced February 2025.
-
Asynchronous Stochastic Block Projection Algorithm for Solving Linear Systems under Predefined Communication Patterns
Authors:
Yanchen Yin,
Yongli Wang
Abstract:
This paper proposes an event-triggered asynchronous distributed randomized block Kaczmarz projection (ER-AD-RBKP) algorithm for efficiently solving large-scale linear systems in resource-constrained and communication-unstable environments. The algorithm enables each agent to update its local state estimate independently and engage in communication only when specific triggering conditions are satis…
▽ More
This paper proposes an event-triggered asynchronous distributed randomized block Kaczmarz projection (ER-AD-RBKP) algorithm for efficiently solving large-scale linear systems in resource-constrained and communication-unstable environments. The algorithm enables each agent to update its local state estimate independently and engage in communication only when specific triggering conditions are satisfied, thereby significantly reducing communication overhead. At each iteration, agents perform projections using randomly selected partial local data blocks to lower per-iteration computational costs and enhance scalability. By defining events that ensure strong connectivity in the communication graph, we derive the sufficient conditions for global convergence under a probabilistic framework, proving that the algorithm converges exponentially in expectation as long as no extreme events (e.g., permanent agent disconnection) occur. Besides, for inconsistent systems, auxiliary variables are incorporated to transform the problem into an equivalent consistent formulation, and theoretical error bounds are derived. Moreover, we implement the ER-AD-RBKP algorithm in an asynchronous communication environment built on ROS2, a distributed middleware framework for real-time robotic systems. We evaluate the algorithm under various settings, including varying numbers of agents, neighborhood sizes, communication intervals, and failure scenarios such as communication disruptions and processing faults. Experimental results demonstrate the robust performance of the proposed algorithm in terms of computational efficiency, communication cost, and system resilience, highlighting its strong potential for practical applicability in real-world distributed systems.
△ Less
Submitted 15 June, 2025; v1 submitted 19 February, 2025;
originally announced February 2025.
-
Idiosyncrasies in Large Language Models
Authors:
Mingjie Sun,
Yida Yin,
Zhiqiu Xu,
J. Zico Kolter,
Zhuang Liu
Abstract:
In this work, we unveil and study idiosyncrasies in Large Language Models (LLMs) -- unique patterns in their outputs that can be used to distinguish the models. To do so, we consider a simple classification task: given a particular text output, the objective is to predict the source LLM that generates the text. We evaluate this synthetic task across various groups of LLMs and find that simply fine…
▽ More
In this work, we unveil and study idiosyncrasies in Large Language Models (LLMs) -- unique patterns in their outputs that can be used to distinguish the models. To do so, we consider a simple classification task: given a particular text output, the objective is to predict the source LLM that generates the text. We evaluate this synthetic task across various groups of LLMs and find that simply fine-tuning text embedding models on LLM-generated texts yields excellent classification accuracy. Notably, we achieve 97.1% accuracy on held-out validation data in the five-way classification problem involving ChatGPT, Claude, Grok, Gemini, and DeepSeek. Our further investigation reveals that these idiosyncrasies are rooted in word-level distributions. These patterns persist even when the texts are rewritten, translated, or summarized by an external LLM, suggesting that they are also encoded in the semantic content. Additionally, we leverage LLM as judges to generate detailed, open-ended descriptions of each model's idiosyncrasies. Finally, we discuss the broader implications of our findings, including training on synthetic data, inferring model similarity, and robust evaluation of LLMs. Code is available at https://github.com/locuslab/llm-idiosyncrasies.
△ Less
Submitted 16 June, 2025; v1 submitted 17 February, 2025;
originally announced February 2025.
-
Teaching LLMs According to Their Aptitude: Adaptive Reasoning for Mathematical Problem Solving
Authors:
Xin Xu,
Yan Xu,
Tianhao Chen,
Yuchen Yan,
Chengwu Liu,
Zaoyu Chen,
Yufei Wang,
Yichun Yin,
Yasheng Wang,
Lifeng Shang,
Qun Liu
Abstract:
Existing approaches to mathematical reasoning with large language models (LLMs) rely on Chain-of-Thought (CoT) for generalizability or Tool-Integrated Reasoning (TIR) for precise computation. While efforts have been made to combine these methods, they primarily rely on post-selection or predefined strategies, leaving an open question: whether LLMs can autonomously adapt their reasoning strategy ba…
▽ More
Existing approaches to mathematical reasoning with large language models (LLMs) rely on Chain-of-Thought (CoT) for generalizability or Tool-Integrated Reasoning (TIR) for precise computation. While efforts have been made to combine these methods, they primarily rely on post-selection or predefined strategies, leaving an open question: whether LLMs can autonomously adapt their reasoning strategy based on their inherent capabilities. In this work, we propose TATA (Teaching LLMs According to Their Aptitude), an adaptive framework that enables LLMs to personalize their reasoning strategy spontaneously, aligning it with their intrinsic aptitude. TATA incorporates base-LLM-aware data selection during supervised fine-tuning (SFT) to tailor training data to the model's unique abilities. This approach equips LLMs to autonomously determine and apply the appropriate reasoning strategy at test time. We evaluate TATA through extensive experiments on six mathematical reasoning benchmarks, using both general-purpose and math-specialized LLMs. Empirical results demonstrate that TATA effectively combines the complementary strengths of CoT and TIR, achieving superior or comparable performance with improved inference efficiency compared to TIR alone. Further analysis underscores the critical role of aptitude-aware data selection in enabling LLMs to make effective and adaptive reasoning decisions and align reasoning strategies with model capabilities.
△ Less
Submitted 25 February, 2025; v1 submitted 17 February, 2025;
originally announced February 2025.
-
Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction
Authors:
Ailin Huang,
Boyong Wu,
Bruce Wang,
Chao Yan,
Chen Hu,
Chengli Feng,
Fei Tian,
Feiyu Shen,
Jingbei Li,
Mingrui Chen,
Peng Liu,
Ruihang Miao,
Wang You,
Xi Chen,
Xuerui Yang,
Yechang Huang,
Yuxiang Zhang,
Zheng Gong,
Zixin Zhang,
Hongyu Zhou,
Jianjian Sun,
Brian Li,
Chengting Feng,
Changyi Wan,
Hanpeng Hu
, et al. (120 additional authors not shown)
Abstract:
Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contribu…
▽ More
Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex tasks effectively. Based on our new StepEval-Audio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following. On open-source benchmarks like LLaMA Question, shows 9.3% average performance improvement, demonstrating our commitment to advancing the development of open-source multi-modal language technologies. Our code and models are available at https://github.com/stepfun-ai/Step-Audio.
△ Less
Submitted 18 February, 2025; v1 submitted 17 February, 2025;
originally announced February 2025.
-
Local Gibbs sampling beyond local uniformity
Authors:
Hongyang Liu,
Chunyang Wang,
Yitong Yin
Abstract:
Local samplers are algorithms that generate random samples based on local queries to high-dimensional distributions, ensuring the samples follow the correct induced distributions while maintaining time complexity that scales locally with the query size. These samplers have broad applications, including deterministic approximate counting [He, Wang, Yin, SODA '23; Feng et al., FOCS '23], sampling fr…
▽ More
Local samplers are algorithms that generate random samples based on local queries to high-dimensional distributions, ensuring the samples follow the correct induced distributions while maintaining time complexity that scales locally with the query size. These samplers have broad applications, including deterministic approximate counting [He, Wang, Yin, SODA '23; Feng et al., FOCS '23], sampling from infinite or high-dimensional Gibbs distributions [Anand, Jerrum, SICOMP '22; He, Wang, Yin, FOCS '22], and providing local access to large random objects [Biswas, Rubinfield, Yodpinyanee, ITCS '20].
In this work, we present a local sampler for Gibbs distributions of spin systems whose efficiency does not rely on the "local uniformity" property, which imposes unconditional marginal lower bounds -- a key assumption required by all prior local samplers. For fundamental models such as the Ising model, our algorithm achieves local efficiency in near-critical regimes, providing an exponential improvement over existing methods. Additionally, our approach is applicable to spin systems on graphs with unbounded degrees and supports dynamic sampling within the same near-critical regime.
△ Less
Submitted 15 February, 2025;
originally announced February 2025.
-
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Authors:
Guoqing Ma,
Haoyang Huang,
Kun Yan,
Liangyu Chen,
Nan Duan,
Shengming Yin,
Changyi Wan,
Ranchen Ming,
Xiaoniu Song,
Xing Chen,
Yu Zhou,
Deshan Sun,
Deyu Zhou,
Jian Zhou,
Kaijun Tan,
Kang An,
Mei Chen,
Wei Ji,
Qiling Wu,
Wen Sun,
Xin Han,
Yanan Wei,
Zheng Ge,
Aojie Li,
Bin Wang
, et al. (90 additional authors not shown)
Abstract:
We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length. A deep compression Variational Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios, while maintaining exceptional video reconstruction quality. User prompts are encoded…
▽ More
We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length. A deep compression Variational Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios, while maintaining exceptional video reconstruction quality. User prompts are encoded using two bilingual text encoders to handle both English and Chinese. A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames. A video-based DPO approach, Video-DPO, is applied to reduce artifacts and improve the visual quality of the generated videos. We also detail our training strategies and share key observations and insights. Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its state-of-the-art text-to-video quality when compared with both open-source and commercial engines. Additionally, we discuss the limitations of current diffusion-based model paradigm and outline future directions for video foundation models. We make both Step-Video-T2V and Step-Video-T2V-Eval available at https://github.com/stepfun-ai/Step-Video-T2V. The online version can be accessed from https://yuewen.cn/videos as well. Our goal is to accelerate the innovation of video foundation models and empower video content creators.
△ Less
Submitted 24 February, 2025; v1 submitted 14 February, 2025;
originally announced February 2025.
-
Do we really have to filter out random noise in pre-training data for language models?
Authors:
Jinghan Ru,
Yuxin Xie,
Xianwei Zhuang,
Yuguo Yin,
Zhihui Guo,
Zhiming Liu,
Qianli Ren,
Yuexian Zou
Abstract:
Web-scale pre-training datasets are the cornerstone of LLMs' success. However, text data curated from the Internet inevitably contains random noise caused by decoding errors or unregulated web content. In contrast to previous works that focus on low quality or synthetic data, our study \textbf{provides the first systematic investigation of such random noise through a cohesive ``What-Why-How'' fram…
▽ More
Web-scale pre-training datasets are the cornerstone of LLMs' success. However, text data curated from the Internet inevitably contains random noise caused by decoding errors or unregulated web content. In contrast to previous works that focus on low quality or synthetic data, our study \textbf{provides the first systematic investigation of such random noise through a cohesive ``What-Why-How'' framework.} Surprisingly, we observed that the resulting increase in the loss of next-token prediction (NTP) was significantly lower than the proportion of random noise even when the model was scaled up to 2.7B. We provide a theoretical justification for this phenomenon, which also elucidates the success of multilingual models and can be applied to multimodal models. On the other hand, experiments show that the model's performance in downstream tasks is not based solely on the NTP loss, which means that random noise may result in degraded downstream performance. To address the potential adverse effects, we introduce a novel plug-and-play Local Gradient Matching loss, which explicitly enhances the denoising capability of the downstream task head by aligning the gradient of normal and perturbed features without requiring knowledge of the model's parameters. Additional experiments on 8 language and 14 vision benchmarks further validate its effectiveness.
△ Less
Submitted 15 May, 2025; v1 submitted 10 February, 2025;
originally announced February 2025.
-
Regression and Forecasting of U.S. Stock Returns Based on LSTM
Authors:
Shicheng Zhou,
Zizhou Zhang,
Rong Zhang,
Yuchen Yin,
Chia Hong Chang,
Qinyan Shen
Abstract:
This paper analyses the investment returns of three stock sectors, Manuf, Hitec, and Other, in the U.S. stock market, based on the Fama-French three-factor model, the Carhart four-factor model, and the Fama-French five-factor model, in order to test the validity of the Fama-French three-factor model, the Carhart four-factor model, and the Fama-French five-factor model for the three sectors of the…
▽ More
This paper analyses the investment returns of three stock sectors, Manuf, Hitec, and Other, in the U.S. stock market, based on the Fama-French three-factor model, the Carhart four-factor model, and the Fama-French five-factor model, in order to test the validity of the Fama-French three-factor model, the Carhart four-factor model, and the Fama-French five-factor model for the three sectors of the market. French five-factor model for the three sectors of the market. Also, the LSTM model is used to explore the additional factors affecting stock returns. The empirical results show that the Fama-French five-factor model has better validity for the three segments of the market under study, and the LSTM model has the ability to capture the factors affecting the returns of certain industries, and can better regress and predict the stock returns of the relevant industries. Keywords- Fama-French model; Carhart model; Factor model; LSTM model.
△ Less
Submitted 28 May, 2025; v1 submitted 3 February, 2025;
originally announced February 2025.
-
ARR: Question Answering with Large Language Models via Analyzing, Retrieving, and Reasoning
Authors:
Yuwei Yin,
Giuseppe Carenini
Abstract:
Large language models (LLMs) have demonstrated impressive capabilities on complex evaluation benchmarks, many of which are formulated as question-answering (QA) tasks. Enhancing the performance of LLMs in QA contexts is becoming increasingly vital for advancing their development and applicability. This paper introduces ARR, an intuitive, effective, and general QA solving method that explicitly inc…
▽ More
Large language models (LLMs) have demonstrated impressive capabilities on complex evaluation benchmarks, many of which are formulated as question-answering (QA) tasks. Enhancing the performance of LLMs in QA contexts is becoming increasingly vital for advancing their development and applicability. This paper introduces ARR, an intuitive, effective, and general QA solving method that explicitly incorporates three key steps: analyzing the intent of the question, retrieving relevant information, and reasoning step by step. Notably, this paper is the first to introduce intent analysis in QA, which plays a vital role in ARR. Comprehensive evaluations across 10 diverse QA tasks demonstrate that ARR consistently outperforms the baseline methods. Ablation and case studies further validate the positive contributions of each ARR component. Furthermore, experiments involving variations in prompt design indicate that ARR maintains its effectiveness regardless of the specific prompt formulation. Additionally, extensive evaluations across various model sizes, LLM series, and generation settings solidify the effectiveness, robustness, and generalizability of ARR.
△ Less
Submitted 15 May, 2025; v1 submitted 7 February, 2025;
originally announced February 2025.
-
How Generative AI supports human in conceptual design
Authors:
Liuging Chen,
Yaxuan Song,
Jia Guo,
Lingyun Sun,
Peter Childs,
Yuan Yin
Abstract:
Generative Artificial Intelligence (Generative AI) is a collection of AI technologies that can generate new information such as texts and images. With its strong capabilities, Generative AI has been actively studied in creative design processes. However, limited studies have explored the roles of humans and Generative AI in conceptual design processes, leaving a gap for human-AI collaboration inve…
▽ More
Generative Artificial Intelligence (Generative AI) is a collection of AI technologies that can generate new information such as texts and images. With its strong capabilities, Generative AI has been actively studied in creative design processes. However, limited studies have explored the roles of humans and Generative AI in conceptual design processes, leaving a gap for human-AI collaboration investigation. To address this gap, this study uncovers the contributions of different Generative AI technologies in assisting humans in the conceptual design process. Novice designers completed two design tasks with or without the assistance of Generative AI. Results revealed that Generative AI primarily assists humans in problem definition and idea generation stages, while idea selection and evaluation remain predominantly human-led. Additionally, with Generative AI assistance, the idea selection and evaluation stages were further enhanced. Based on the findings, we discuss the role of Generative AI in human-AI collaboration and implications for enhancing future conceptual design support with Generative AI assistance.
△ Less
Submitted 31 January, 2025;
originally announced February 2025.
-
Solid-state Synapse Based on Magnetoelectrically Coupled Memristor
Authors:
Weichuan Huang,
Yue-Wen Fang,
Yuewei Yin,
Bobo Tian,
Wenbo Zhao,
Chuangming Hou,
Chao Ma,
Qi Li,
Evgeny Y. Tsymbal,
Chun-Gang Duan,
Xiaoguang Li
Abstract:
Brain-inspired computing architectures attempt to emulate the computations performed in the neurons and the synapses in human brain. Memristors with continuously tunable resistances are ideal building blocks for artificial synapses. Through investigating the memristor behaviors in a La0.7Sr0.3MnO3/BaTiO3/La0.7Sr0.3MnO3 multiferroic tunnel junction, it was found that the ferroelectric domain dynami…
▽ More
Brain-inspired computing architectures attempt to emulate the computations performed in the neurons and the synapses in human brain. Memristors with continuously tunable resistances are ideal building blocks for artificial synapses. Through investigating the memristor behaviors in a La0.7Sr0.3MnO3/BaTiO3/La0.7Sr0.3MnO3 multiferroic tunnel junction, it was found that the ferroelectric domain dynamics characteristics are influenced by the relative magnetization alignment of the electrodes, and the interfacial spin polarization is manipulated continuously by ferroelectric domain reversal, enriching our understanding of the magnetoelectric coupling fundamentally. This creates a functionality that not only the resistance of the memristor but also the synaptic plasticity form can be further manipulated, as demonstrated by the spike-timing-dependent plasticity investigations. Density functional theory calculations are carried out to describe the obtained magnetoelectric coupling, which is probably related to the Mn-Ti intermixing at the interfaces. The multiple and controllable plasticity characteristic in a single artificial synapse, to resemble the synaptic morphological alteration property in a biological synapse, will be conducive to the development of artificial intelligence.
△ Less
Submitted 31 January, 2025;
originally announced January 2025.
-
BRiTE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning
Authors:
Han Zhong,
Yutong Yin,
Shenao Zhang,
Xiaojun Xu,
Yuanxin Liu,
Yifei Zuo,
Zhihan Liu,
Boyi Liu,
Sirui Zheng,
Hongyi Guo,
Liwei Wang,
Mingyi Hong,
Zhaoran Wang
Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks, yet generating reliable reasoning processes remains a significant challenge. We present a unified probabilistic framework that formalizes LLM reasoning through a novel graphical model incorporating latent thinking processes and evaluation signals. Within this framework, we introduce the Bootstrapping…
▽ More
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks, yet generating reliable reasoning processes remains a significant challenge. We present a unified probabilistic framework that formalizes LLM reasoning through a novel graphical model incorporating latent thinking processes and evaluation signals. Within this framework, we introduce the Bootstrapping Reinforced Thinking Process (BRiTE) algorithm, which works in two steps. First, it generates high-quality rationales by approximating the optimal thinking process through reinforcement learning, using a novel reward shaping mechanism. Second, it enhances the base LLM by maximizing the joint probability of rationale generation with respect to the model's parameters. Theoretically, we demonstrate BRiTE's convergence at a rate of $1/T$ with $T$ representing the number of iterations. Empirical evaluations on math and coding benchmarks demonstrate that our approach consistently improves performance across different base models without requiring human-annotated thinking processes. In addition, BRiTE demonstrates superior performance compared to existing algorithms that bootstrap thinking processes use alternative methods such as rejection sampling, and can even match or exceed the results achieved through supervised fine-tuning with human-annotated data.
△ Less
Submitted 6 June, 2025; v1 submitted 30 January, 2025;
originally announced January 2025.
-
Are Transformers Able to Reason by Connecting Separated Knowledge in Training Data?
Authors:
Yutong Yin,
Zhaoran Wang
Abstract:
Humans exhibit remarkable compositional reasoning by integrating knowledge from various sources. For example, if someone learns ( B = f(A) ) from one source and ( C = g(B) ) from another, they can deduce ( C=g(B)=g(f(A)) ) even without encountering ( ABC ) together, showcasing the generalization ability of human intelligence. In this paper, we introduce a synthetic learning task, "FTCT" (Fragmente…
▽ More
Humans exhibit remarkable compositional reasoning by integrating knowledge from various sources. For example, if someone learns ( B = f(A) ) from one source and ( C = g(B) ) from another, they can deduce ( C=g(B)=g(f(A)) ) even without encountering ( ABC ) together, showcasing the generalization ability of human intelligence. In this paper, we introduce a synthetic learning task, "FTCT" (Fragmented at Training, Chained at Testing), to validate the potential of Transformers in replicating this skill and interpret its inner mechanism. In the training phase, data consist of separated knowledge fragments from an overall causal graph. During testing, Transformers must infer complete causal graph traces by integrating these fragments. Our findings demonstrate that few-shot Chain-of-Thought prompting enables Transformers to perform compositional reasoning on FTCT by revealing correct combinations of fragments, even if such combinations were absent in the training data. Furthermore, the emergence of compositional reasoning ability is strongly correlated with the model complexity and training-testing data similarity. We propose, both theoretically and empirically, that Transformers learn an underlying generalizable program from training, enabling effective compositional reasoning during testing.
△ Less
Submitted 2 June, 2025; v1 submitted 27 January, 2025;
originally announced January 2025.