Search | arXiv e-print repository

HRGS: Hierarchical Gaussian Splatting for Memory-Efficient High-Resolution 3D Reconstruction

Authors: Changbai Li, Haodong Zhu, Hanlin Chen, Juan Zhang, Tongfei Chen, Shuo Yang, Shuwei Shao, Wenhao Dong, Baochang Zhang

Abstract: 3D Gaussian Splatting (3DGS) has made significant strides in real-time 3D scene reconstruction, but faces memory scalability issues in high-resolution scenarios. To address this, we propose Hierarchical Gaussian Splatting (HRGS), a memory-efficient framework with hierarchical block-level optimization. First, we generate a global, coarse Gaussian representation from low-resolution data. Then, we pa… ▽ More 3D Gaussian Splatting (3DGS) has made significant strides in real-time 3D scene reconstruction, but faces memory scalability issues in high-resolution scenarios. To address this, we propose Hierarchical Gaussian Splatting (HRGS), a memory-efficient framework with hierarchical block-level optimization. First, we generate a global, coarse Gaussian representation from low-resolution data. Then, we partition the scene into multiple blocks, refining each block with high-resolution data. The partitioning involves two steps: Gaussian partitioning, where irregular scenes are normalized into a bounded cubic space with a uniform grid for task distribution, and training data partitioning, where only relevant observations are retained for each block. By guiding block refinement with the coarse Gaussian prior, we ensure seamless Gaussian fusion across adjacent blocks. To reduce computational demands, we introduce Importance-Driven Gaussian Pruning (IDGP), which computes importance scores for each Gaussian and removes those with minimal contribution, speeding up convergence and reducing memory usage. Additionally, we incorporate normal priors from a pretrained model to enhance surface reconstruction quality. Our method enables high-quality, high-resolution 3D scene reconstruction even under memory constraints. Extensive experiments on three benchmarks show that HRGS achieves state-of-the-art performance in high-resolution novel view synthesis (NVS) and surface reconstruction tasks. △ Less

Submitted 17 June, 2025; originally announced June 2025.

arXiv:2506.13059 [pdf, ps, other]

Multipole Attention for Efficient Long Context Reasoning

Authors: Coleman Hooper, Sebastian Zhao, Luca Manolache, Sehoon Kim, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami

Abstract: Large Reasoning Models (LRMs) have shown promising accuracy improvements on complex problem-solving tasks. While these models have attained high accuracy by leveraging additional computation at test time, they need to generate long chain-of-thought reasoning in order to think before answering, which requires generating thousands of tokens. While sparse attention methods can help reduce the KV cach… ▽ More Large Reasoning Models (LRMs) have shown promising accuracy improvements on complex problem-solving tasks. While these models have attained high accuracy by leveraging additional computation at test time, they need to generate long chain-of-thought reasoning in order to think before answering, which requires generating thousands of tokens. While sparse attention methods can help reduce the KV cache pressure induced by this long autoregressive reasoning, these methods can introduce errors which disrupt the reasoning process. Additionally, prior methods often pre-process the input to make it easier to identify the important prompt tokens when computing attention during generation, and this pre-processing is challenging to perform online for newly generated reasoning tokens. Our work addresses these challenges by introducing Multipole Attention, which accelerates autoregressive reasoning by only computing exact attention for the most important tokens, while maintaining approximate representations for the remaining tokens. Our method first performs clustering to group together semantically similar key vectors, and then uses the cluster centroids both to identify important key vectors and to approximate the remaining key vectors in order to retain high accuracy. We design a fast cluster update process to quickly re-cluster the input and previously generated tokens, thereby allowing for accelerating attention to the previous output tokens. We evaluate our method using emerging LRMs such as Qwen-8B, demonstrating that our approach can maintain accuracy on complex reasoning tasks even with aggressive attention sparsity settings. We also provide kernel implementations to demonstrate the practical efficiency gains from our method, achieving up to 4.5$\times$ speedup for attention in long-context reasoning applications. Our code is available at https://github.com/SqueezeAILab/MultipoleAttention. △ Less

Submitted 15 June, 2025; originally announced June 2025.

Comments: 15 pages

arXiv:2506.11244 [pdf, ps, other]

Iterative Multilingual Spectral Attribute Erasure

Authors: Shun Shao, Yftah Ziser, Zheng Zhao, Yifu Qiu, Shay B. Cohen, Anna Korhonen

Abstract: Multilingual representations embed words with similar meanings to share a common semantic space across languages, creating opportunities to transfer debiasing effects between languages. However, existing methods for debiasing are unable to exploit this opportunity because they operate on individual languages. We present Iterative Multilingual Spectral Attribute Erasure (IMSAE), which identifies an… ▽ More Multilingual representations embed words with similar meanings to share a common semantic space across languages, creating opportunities to transfer debiasing effects between languages. However, existing methods for debiasing are unable to exploit this opportunity because they operate on individual languages. We present Iterative Multilingual Spectral Attribute Erasure (IMSAE), which identifies and mitigates joint bias subspaces across multiple languages through iterative SVD-based truncation. Evaluating IMSAE across eight languages and five demographic dimensions, we demonstrate its effectiveness in both standard and zero-shot settings, where target language data is unavailable, but linguistically similar languages can be used for debiasing. Our comprehensive experiments across diverse language models (BERT, LLaMA, Mistral) show that IMSAE outperforms traditional monolingual and cross-lingual approaches while maintaining model utility. △ Less

Submitted 12 June, 2025; originally announced June 2025.

Comments: 8 pages, 3 figures

arXiv:2506.10741 [pdf, ps, other]

PosterCraft: Rethinking High-Quality Aesthetic Poster Generation in a Unified Framework

Authors: SiXiang Chen, Jianyu Lai, Jialin Gao, Tian Ye, Haoyu Chen, Hengyu Shi, Shitong Shao, Yunlong Lin, Song Fei, Zhaohu Xing, Yeying Jin, Junfeng Luo, Xiaoming Wei, Lei Zhu

Abstract: Generating aesthetic posters is more challenging than simple design images: it requires not only precise text rendering but also the seamless integration of abstract artistic content, striking layouts, and overall stylistic harmony. To address this, we propose PosterCraft, a unified framework that abandons prior modular pipelines and rigid, predefined layouts, allowing the model to freely explore… ▽ More Generating aesthetic posters is more challenging than simple design images: it requires not only precise text rendering but also the seamless integration of abstract artistic content, striking layouts, and overall stylistic harmony. To address this, we propose PosterCraft, a unified framework that abandons prior modular pipelines and rigid, predefined layouts, allowing the model to freely explore coherent, visually compelling compositions. PosterCraft employs a carefully designed, cascaded workflow to optimize the generation of high-aesthetic posters: (i) large-scale text-rendering optimization on our newly introduced Text-Render-2M dataset; (ii) region-aware supervised fine-tuning on HQ-Poster100K; (iii) aesthetic-text-reinforcement learning via best-of-n preference optimization; and (iv) joint vision-language feedback refinement. Each stage is supported by a fully automated data-construction pipeline tailored to its specific needs, enabling robust training without complex architectural modifications. Evaluated on multiple experiments, PosterCraft significantly outperforms open-source baselines in rendering accuracy, layout coherence, and overall visual appeal-approaching the quality of SOTA commercial systems. Our code, models, and datasets can be found in the Project page: https://ephemeral182.github.io/PosterCraft △ Less

Submitted 12 June, 2025; originally announced June 2025.

arXiv:2506.04544 [pdf, other]

hdl2v: A Code Translation Dataset for Enhanced LLM Verilog Generation

Authors: Charles Hong, Brendan Roberts, Huijae An, Alex Um, Advay Ratan, Yakun Sophia Shao

Abstract: Large language models (LLMs) are playing an increasingly large role in domains such as code generation, including hardware code generation, where Verilog is the key language. However, the amount of publicly available Verilog code pales in comparison to the amount of code available for software languages like Python. In this work, we present hdl2v ("HDL-to-Verilog"), a dataset which seeks to increa… ▽ More Large language models (LLMs) are playing an increasingly large role in domains such as code generation, including hardware code generation, where Verilog is the key language. However, the amount of publicly available Verilog code pales in comparison to the amount of code available for software languages like Python. In this work, we present hdl2v ("HDL-to-Verilog"), a dataset which seeks to increase the amount of available human-written Verilog data by translating or compiling three other hardware description languages - VHDL, Chisel, and PyMTL3 - to Verilog. Furthermore, we demonstrate the value of hdl2v in enhancing LLM Verilog generation by improving performance of a 32 billion-parameter open-weight model by up to 23% (pass@10) in VerilogEvalV2, without utilizing any data augmentation or knowledge distillation from larger models. We also show hdl2v's ability to boost the performance of a data augmentation-based fine-tuning approach by 63%. Finally, we characterize and analyze our dataset to better understand which characteristics of HDL-to-Verilog datasets can be expanded upon in future work for even better performance. △ Less

Submitted 4 June, 2025; originally announced June 2025.

arXiv:2506.04361 [pdf, ps, other]

ViT-based Local Volume dwarf galaxy Identificationin (VIDA) in the CSST survey

Authors: Han Qu, Zhen Yuan, Chengliang Wei, Chao Liu, Jiang Chang, Guoliang Li, Nicolas F. Martin, Chaowei Tsai, Shi Shao, Yu Luo, Ran Li, Xi Kang, Xiangxiang Xue, Zhou Fan

Abstract: Identifying dwarf galaxies within the Local Volume is crucial for constraining the luminosity function of satellite galaxies in the nearby universe. We report the detection capabilities of dwarf galaxies within the Local Volume using the Chinese Space Station Telescope (CSST). Based on the simulated imaging data of CSST, we develop a detection and classification pipeline that combines traditional… ▽ More Identifying dwarf galaxies within the Local Volume is crucial for constraining the luminosity function of satellite galaxies in the nearby universe. We report the detection capabilities of dwarf galaxies within the Local Volume using the Chinese Space Station Telescope (CSST). Based on the simulated imaging data of CSST, we develop a detection and classification pipeline that combines traditional image-based search techniques with advanced machine learning classification models. The simulated Local Volume dwarf galaxies can be identified using a pre-processing method for "extended source detection", followed by classification with a pretrained ViT-Base model. This pipeline achieves a true positive rate (TPR) exceeding 85% with a false positive rate (FPR) of only 0.1%. We quantify the detection completeness of Local Volume dwarf galaxies across a three-dimensional parameter space defined by absolute magnitude ($M_V$), half-light radius ($R_h$), and heliocentric distance, based on simulated single-exposure CSST wide-field imaging survey data. For unresolved or semi-resolved dwarf galaxies, our method achieves a significantly deeper absolute magnitude detection limit compared to catalog-based approaches, reaching $M_V = -7$ within 10 \Mpc. By combining this image-based approach with traditional stellar catalog-based "matched filter" techniques, our automated framework established in this work can identify dwarf galaxies within 20 \Mpc for the CSST mission. △ Less

Submitted 4 June, 2025; originally announced June 2025.

Comments: 16 pages, 18 figures

arXiv:2506.01942 [pdf, ps, other]

OD3: Optimization-free Dataset Distillation for Object Detection

Authors: Salwa K. Al Khatib, Ahmed ElHagry, Shitong Shao, Zhiqiang Shen

Abstract: Training large neural networks on large-scale datasets requires substantial computational resources, particularly for dense prediction tasks such as object detection. Although dataset distillation (DD) has been proposed to alleviate these demands by synthesizing compact datasets from larger ones, most existing work focuses solely on image classification, leaving the more complex detection setting… ▽ More Training large neural networks on large-scale datasets requires substantial computational resources, particularly for dense prediction tasks such as object detection. Although dataset distillation (DD) has been proposed to alleviate these demands by synthesizing compact datasets from larger ones, most existing work focuses solely on image classification, leaving the more complex detection setting largely unexplored. In this paper, we introduce OD3, a novel optimization-free data distillation framework specifically designed for object detection. Our approach involves two stages: first, a candidate selection process in which object instances are iteratively placed in synthesized images based on their suitable locations, and second, a candidate screening process using a pre-trained observer model to remove low-confidence objects. We perform our data synthesis framework on MS COCO and PASCAL VOC, two popular detection datasets, with compression ratios ranging from 0.25% to 5%. Compared to the prior solely existing dataset distillation method on detection and conventional core set selection methods, OD3 delivers superior accuracy, establishes new state-of-the-art results, surpassing prior best method by more than 14% on COCO mAP50 at a compression ratio of 1.0%. Code and condensed datasets are available at: https://github.com/VILA-Lab/OD3. △ Less

Submitted 2 June, 2025; originally announced June 2025.

Comments: Equal Contribution of the first three authors

arXiv:2506.00618 [pdf, ps, other]

RiOSWorld: Benchmarking the Risk of Multimodal Computer-Use Agents

Authors: Jingyi Yang, Shuai Shao, Dongrui Liu, Jing Shao

Abstract: With the rapid development of multimodal large language models (MLLMs), they are increasingly deployed as autonomous computer-use agents capable of accomplishing complex computer tasks. However, a pressing issue arises: Can the safety risk principles designed and aligned for general MLLMs in dialogue scenarios be effectively transferred to real-world computer-use scenarios? Existing research on ev… ▽ More With the rapid development of multimodal large language models (MLLMs), they are increasingly deployed as autonomous computer-use agents capable of accomplishing complex computer tasks. However, a pressing issue arises: Can the safety risk principles designed and aligned for general MLLMs in dialogue scenarios be effectively transferred to real-world computer-use scenarios? Existing research on evaluating the safety risks of MLLM-based computer-use agents suffers from several limitations: it either lacks realistic interactive environments, or narrowly focuses on one or a few specific risk types. These limitations ignore the complexity, variability, and diversity of real-world environments, thereby restricting comprehensive risk evaluation for computer-use agents. To this end, we introduce \textbf{RiOSWorld}, a benchmark designed to evaluate the potential risks of MLLM-based agents during real-world computer manipulations. Our benchmark includes 492 risky tasks spanning various computer applications, involving web, social media, multimedia, os, email, and office software. We categorize these risks into two major classes based on their risk source: (i) User-originated risks and (ii) Environmental risks. For the evaluation, we evaluate safety risks from two perspectives: (i) Risk goal intention and (ii) Risk goal completion. Extensive experiments with multimodal agents on \textbf{RiOSWorld} demonstrate that current computer-use agents confront significant safety risks in real-world scenarios. Our findings highlight the necessity and urgency of safety alignment for computer-use agents in real-world computer manipulation, providing valuable insights for developing trustworthy computer-use agents. Our benchmark is publicly available at https://yjyddq.github.io/RiOSWorld.github.io/. △ Less

Submitted 4 June, 2025; v1 submitted 31 May, 2025; originally announced June 2025.

Comments: 40 pages, 6 figures, Project Page: https://yjyddq.github.io/RiOSWorld.github.io/

arXiv:2505.22863 [pdf, other]

Large Language Models for Depression Recognition in Spoken Language Integrating Psychological Knowledge

Authors: Yupei Li, Shuaijie Shao, Manuel Milling, Björn W. Schuller

Abstract: Depression is a growing concern gaining attention in both public discourse and AI research. While deep neural networks (DNNs) have been used for recognition, they still lack real-world effectiveness. Large language models (LLMs) show strong potential but require domain-specific fine-tuning and struggle with non-textual cues. Since depression is often expressed through vocal tone and behaviour rath… ▽ More Depression is a growing concern gaining attention in both public discourse and AI research. While deep neural networks (DNNs) have been used for recognition, they still lack real-world effectiveness. Large language models (LLMs) show strong potential but require domain-specific fine-tuning and struggle with non-textual cues. Since depression is often expressed through vocal tone and behaviour rather than explicit text, relying on language alone is insufficient. Diagnostic accuracy also suffers without incorporating psychological expertise. To address these limitations, we present, to the best of our knowledge, the first application of LLMs to multimodal depression detection using the DAIC-WOZ dataset. We extract the audio features using the pre-trained model Wav2Vec, and mapped it to text-based LLMs for further processing. We also propose a novel strategy for incorporating psychological knowledge into LLMs to enhance diagnostic performance, specifically using a question and answer set to grant authorised knowledge to LLMs. Our approach yields a notable improvement in both Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) compared to a base score proposed by the related original paper. The codes are available at https://github.com/myxp-lyp/Depression-detection.git △ Less

Submitted 28 May, 2025; originally announced May 2025.

arXiv:2505.18637 [pdf, ps, other]

Neural Coding Is Not Always Semantic: Towards The Standardized Coding Workflow in Semantic Communications

Authors: Hai-Long Qin, Jincheng Dai, Sixian Wang, Xiaoqi Qin, Shuo Shao, Kai Niu, Wenjun Xu, Ping Zhang

Abstract: Semantic communication, leveraging advanced deep learning techniques, emerges as a new paradigm that meets the requirements of next-generation wireless networks. However, current semantic communication systems, which employ neural coding for feature extraction from raw data, have not adequately addressed the fundamental question: Is general feature extraction through deep neural networks sufficien… ▽ More Semantic communication, leveraging advanced deep learning techniques, emerges as a new paradigm that meets the requirements of next-generation wireless networks. However, current semantic communication systems, which employ neural coding for feature extraction from raw data, have not adequately addressed the fundamental question: Is general feature extraction through deep neural networks sufficient for understanding semantic meaning within raw data in semantic communication? This article is thus motivated to clarify two critical aspects: semantic understanding and general semantic representation. This article presents a standardized definition on semantic coding, an extensive neural coding scheme for general semantic representation that clearly represents underlying data semantics based on contextual modeling. With these general semantic representations obtained, both human- and machine-centric end-to-end data transmission can be achieved through only minimal specialized modifications, such as fine-tuning and regularization. This article contributes to establishing a commonsense that semantic communication extends far beyond mere feature transmission, focusing instead on conveying compact semantic representations through context-aware coding schemes. △ Less

Submitted 24 May, 2025; originally announced May 2025.

arXiv:2505.18574 [pdf, ps, other]

Autocomp: LLM-Driven Code Optimization for Tensor Accelerators

Authors: Charles Hong, Sahil Bhatia, Alvin Cheung, Yakun Sophia Shao

Abstract: Hardware accelerators, especially those designed for tensor processing, have become ubiquitous in today's computing landscape. However, even with significant efforts in building compilers, programming these tensor accelerators remains challenging, leaving much of their potential underutilized. Recently, large language models (LLMs), trained on large amounts of code, have shown significant promise… ▽ More Hardware accelerators, especially those designed for tensor processing, have become ubiquitous in today's computing landscape. However, even with significant efforts in building compilers, programming these tensor accelerators remains challenging, leaving much of their potential underutilized. Recently, large language models (LLMs), trained on large amounts of code, have shown significant promise in code generation and optimization tasks, but generating low-resource languages like specialized tensor accelerator code still poses a significant challenge. We tackle this challenge with Autocomp, an approach that empowers accelerator programmers to leverage domain knowledge and hardware feedback to optimize code via an automated LLM-driven search. We accomplish this by: 1) formulating each optimization pass as a structured two-phase prompt, divided into planning and code generation phases, 2) inserting domain knowledge during planning via a concise and adaptable optimization menu, and 3) integrating correctness and performance metrics from hardware as feedback at each search iteration. Across three categories of representative workloads and two different accelerators, we demonstrate that Autocomp-optimized code runs 5.6x (GEMM) and 2.7x (convolution) faster than the vendor-provided library, and outperforms expert-level hand-tuned code by 1.4x (GEMM), 1.1x (convolution), and 1.3x (fine-grained linear algebra). Additionally, we demonstrate that optimization schedules generated from Autocomp can be reused across similar tensor operations, improving speedups by up to 24% under a fixed sample budget. △ Less

Submitted 5 June, 2025; v1 submitted 24 May, 2025; originally announced May 2025.

arXiv:2505.16505 [pdf, ps, other]

Sparse Activation Editing for Reliable Instruction Following in Narratives

Authors: Runcong Zhao, Chengyu Cao, Qinglin Zhu, Xiucheng Lv, Shun Shao, Lin Gui, Ruifeng Xu, Yulan He

Abstract: Complex narrative contexts often challenge language models' ability to follow instructions, and existing benchmarks fail to capture these difficulties. To address this, we propose Concise-SAE, a training-free framework that improves instruction following by identifying and editing instruction-relevant neurons using only natural language instructions, without requiring labelled data. To thoroughly… ▽ More Complex narrative contexts often challenge language models' ability to follow instructions, and existing benchmarks fail to capture these difficulties. To address this, we propose Concise-SAE, a training-free framework that improves instruction following by identifying and editing instruction-relevant neurons using only natural language instructions, without requiring labelled data. To thoroughly evaluate our method, we introduce FreeInstruct, a diverse and realistic benchmark of 1,212 examples that highlights the challenges of instruction following in narrative-rich settings. While initially motivated by complex narratives, Concise-SAE demonstrates state-of-the-art instruction adherence across varied tasks without compromising generation quality. △ Less

Submitted 22 May, 2025; originally announced May 2025.

arXiv:2505.14205 [pdf, ps, other]

Structure theorems of commuting transformations and minimal $\mathbb{R}$-flows

Authors: Song Shao, Hui Xu

Abstract: In this paper, we develop several structure theorems concerning commuting transformations and minimal $\mathbb{R}$-flows. Specifically, we show that if $(X,S)$, $(X,T)$ are minimal systems with $S$ and $T$ being commutative, then they possess an identical higher-order regionally proximal relation. Consequently, both $(X, S)$ and $(X, T)$ share the same increasing sequence of pro-nilfactors. For mi… ▽ More In this paper, we develop several structure theorems concerning commuting transformations and minimal $\mathbb{R}$-flows. Specifically, we show that if $(X,S)$, $(X,T)$ are minimal systems with $S$ and $T$ being commutative, then they possess an identical higher-order regionally proximal relation. Consequently, both $(X, S)$ and $(X, T)$ share the same increasing sequence of pro-nilfactors. For minimal $\mathbb{R}$-flows, we introduce the concept of higher-order regionally proximal relations and nilfactors, and establish that nilfactors are characteristic factors for minimal $\mathbb{R}$-flows, up to almost one to one extensions. △ Less

Submitted 20 May, 2025; originally announced May 2025.

Comments: 41pages

arXiv:2505.14135 [pdf, other]

Hunyuan-Game: Industrial-grade Intelligent Game Creation Model

Authors: Ruihuang Li, Caijin Zhou, Shoujian Zheng, Jianxiang Lu, Jiabin Huang, Comi Chen, Junshu Tang, Guangzheng Xu, Jiale Tao, Hongmei Wang, Donghao Li, Wenqing Yu, Senbo Wang, Zhimin Li, Yetshuan Shi, Haoyu Yang, Yukun Wang, Wenxun Dai, Jiaqi Li, Linqing Wang, Qixun Wang, Zhiyong Xu, Yingfang Zhang, Jiangfeng Xiong, Weijie Kong , et al. (33 additional authors not shown)

Abstract: Intelligent game creation represents a transformative advancement in game development, utilizing generative artificial intelligence to dynamically generate and enhance game content. Despite notable progress in generative models, the comprehensive synthesis of high-quality game assets, including both images and videos, remains a challenging frontier. To create high-fidelity game content that simult… ▽ More Intelligent game creation represents a transformative advancement in game development, utilizing generative artificial intelligence to dynamically generate and enhance game content. Despite notable progress in generative models, the comprehensive synthesis of high-quality game assets, including both images and videos, remains a challenging frontier. To create high-fidelity game content that simultaneously aligns with player preferences and significantly boosts designer efficiency, we present Hunyuan-Game, an innovative project designed to revolutionize intelligent game production. Hunyuan-Game encompasses two primary branches: image generation and video generation. The image generation component is built upon a vast dataset comprising billions of game images, leading to the development of a group of customized image generation models tailored for game scenarios: (1) General Text-to-Image Generation. (2) Game Visual Effects Generation, involving text-to-effect and reference image-based game visual effect generation. (3) Transparent Image Generation for characters, scenes, and game visual effects. (4) Game Character Generation based on sketches, black-and-white images, and white models. The video generation component is built upon a comprehensive dataset of millions of game and anime videos, leading to the development of five core algorithmic models, each targeting critical pain points in game development and having robust adaptation to diverse game video scenarios: (1) Image-to-Video Generation. (2) 360 A/T Pose Avatar Video Synthesis. (3) Dynamic Illustration Generation. (4) Generative Video Super-Resolution. (5) Interactive Game Video Generation. These image and video generation models not only exhibit high-level aesthetic expression but also deeply integrate domain-specific knowledge, establishing a systematic understanding of diverse game and anime art styles. △ Less

Submitted 28 May, 2025; v1 submitted 20 May, 2025; originally announced May 2025.

arXiv:2505.11792 [pdf, ps, other]

Solver-Informed RL: Grounding Large Language Models for Authentic Optimization Modeling

Authors: Yitian Chen, Jingfan Xia, Siyu Shao, Dongdong Ge, Yinyu Ye

Abstract: Optimization modeling is fundamental to decision-making across diverse domains. Despite progress in automating optimization formulation from natural language descriptions, Large Language Models (LLMs) often struggle to generate formally correct and usable models against hallucinations, posing a challenge for reliable automation. Inspired by the success of Reinforcement Learning (RL) in enhancing L… ▽ More Optimization modeling is fundamental to decision-making across diverse domains. Despite progress in automating optimization formulation from natural language descriptions, Large Language Models (LLMs) often struggle to generate formally correct and usable models against hallucinations, posing a challenge for reliable automation. Inspired by the success of Reinforcement Learning (RL) in enhancing Large Reasoning Models, we present Solver-Informed Reinforcement Learning (SIRL), a novel framework that significantly improves the authenticity of LLMs for optimization modeling using Reinforcement Learning with Verifiable Reward by leveraging external optimization solvers as verifiers. These verifiers automatically assess the executable code and the instance-level mathematical model represented by the associated LP file, yielding precise and comprehensive feedback signals -- including syntax, feasibility, and solution quality, serving as direct rewards for the RL process. This automated verification process, particularly from classic optimization solvers, also underpins our instance-enhanced self-consistency method to synthesize high-quality training data. Extensive experiments on diverse public benchmarks demonstrate that SIRL achieves state-of-the-art performance, substantially outperforming existing methods in generating accurate and executable optimization models. Our code is publicly available at https://github.com/Cardinal-Operations/SIRL. △ Less

Submitted 28 May, 2025; v1 submitted 16 May, 2025; originally announced May 2025.

arXiv:2505.09243 [pdf, ps, other]

The Shape and Mass of the Galactic Dark Matter Halo from the Axisymmetric Jeans Model

Authors: Lan Zhang, Xiang-Xiang Xue, Ling Zhu, Ruizhi Zhang, Chengqun Yang, Shi Shao, Jiang Chang, Feilu Wang, Hao Tian, Gang Zhao, Chao Liu

Abstract: We explore the density profile, shape, and virial mass of the Milky Way's dark matter halo using K giants (KG) from LAMOST and SDSS/SEGUE, as well as blue horizontal branch (BHB) stars from SDSS. Incorporating Gaia DR3 proper motions, we first investigate the velocity ellipsoid distribution within the $(R, |z|)$ space. The ellipsoids projected onto the $(v_R, v_z)$ plane exhibit near-spherical ali… ▽ More We explore the density profile, shape, and virial mass of the Milky Way's dark matter halo using K giants (KG) from LAMOST and SDSS/SEGUE, as well as blue horizontal branch (BHB) stars from SDSS. Incorporating Gaia DR3 proper motions, we first investigate the velocity ellipsoid distribution within the $(R, |z|)$ space. The ellipsoids projected onto the $(v_R, v_z)$ plane exhibit near-spherical alignment. We then probe the underlying dark matter distribution using the axisymmetric Jeans equations with multi-Gaussian expansion (MGE) and the spherically aligned Jeans anisotropic modelling (JAM${\rm sph}$), allowing for different flattened dark matter density models. For each model, we apply two fitting approaches: fitting the KGs and BHBs separately or fit them simultaneously as two dynamical tracers in one gravitational potential. We find consistent results on the dark matter density profiles, $r_{200}$, and $M_{200}$ within a 1-$σ$ confidence region for models constrained by KGs, BHBs, and both. We find the strongest consistency between KGs and BHBs in constraining dark matter profiles for models incorporating radially varying halo flattening ($q(r_{\rm gc})$), which suggests the Milky Way's dark matter halo shape evolves with Galactocentric distance ($r_{\rm gc}$). Specifically, the halo flattening parameter $q_h$ decreases within $r_{\rm gc} < 20$ kpc and increases for $r_{\rm gc} > 20$ kpc. In this model, $M_{\rm tot} (< 60~{\rm kpc}) = 0.533^{+0.061}_{-0.054} \times 10^{12}$ $M_{\odot}$, $r_{200}$ is $188\pm15$ kpc, with $M_{200}$ estimated at $0.820^{+0.210}_{-0.186} \times 10^{12} M_{\odot}$. △ Less

Submitted 14 May, 2025; originally announced May 2025.

Comments: 15 figures, 2 tables. Accepted for publication in ApJ

arXiv:2505.04684 [pdf, ps, other]

Parity anomaly from LSM: exact valley symmetries on the lattice

Authors: Salvatore D. Pace, Minho Luke Kim, Arkya Chatterjee, Shu-Heng Shao

Abstract: We show that the honeycomb tight-binding model hosts an exact microscopic avatar of its low-energy SU(2) valley symmetry and parity anomaly. Specifically, the SU(2) valley symmetry arises from a collection of conserved, integer-quantized charge operators that obey the Onsager algebra. Along with lattice reflection and time-reversal symmetries, this Onsager symmetry has a Lieb-Schultz-Mattis (LSM)… ▽ More We show that the honeycomb tight-binding model hosts an exact microscopic avatar of its low-energy SU(2) valley symmetry and parity anomaly. Specifically, the SU(2) valley symmetry arises from a collection of conserved, integer-quantized charge operators that obey the Onsager algebra. Along with lattice reflection and time-reversal symmetries, this Onsager symmetry has a Lieb-Schultz-Mattis (LSM) anomaly that matches the parity anomaly in the IR. Indeed, we show that any local Hamiltonian commuting with these symmetries cannot have a trivial unique gapped ground state. We study the phase diagram of the simplest symmetric model and survey various deformations, including Haldane's mass term, which preserves only the Onsager symmetry. Our results place the parity anomaly in 2+1D alongside Schwinger's anomaly in 1+1D and Witten's SU(2) anomaly in 3+1D as 't Hooft anomalies that can arise from the Onsager symmetry on the lattice. △ Less

Submitted 7 May, 2025; originally announced May 2025.

Comments: 7 pages plus appendices

Report number: MIT-CTP/5869, YITP-SB-2025-10

arXiv:2504.21738 [pdf, ps, other]

LangWBC: Language-directed Humanoid Whole-Body Control via End-to-end Learning

Authors: Yiyang Shao, Xiaoyu Huang, Bike Zhang, Qiayuan Liao, Yuman Gao, Yufeng Chi, Zhongyu Li, Sophia Shao, Koushil Sreenath

Abstract: General-purpose humanoid robots are expected to interact intuitively with humans, enabling seamless integration into daily life. Natural language provides the most accessible medium for this purpose. However, translating language into humanoid whole-body motion remains a significant challenge, primarily due to the gap between linguistic understanding and physical actions. In this work, we present… ▽ More General-purpose humanoid robots are expected to interact intuitively with humans, enabling seamless integration into daily life. Natural language provides the most accessible medium for this purpose. However, translating language into humanoid whole-body motion remains a significant challenge, primarily due to the gap between linguistic understanding and physical actions. In this work, we present an end-to-end, language-directed policy for real-world humanoid whole-body control. Our approach combines reinforcement learning with policy distillation, allowing a single neural network to interpret language commands and execute corresponding physical actions directly. To enhance motion diversity and compositionality, we incorporate a Conditional Variational Autoencoder (CVAE) structure. The resulting policy achieves agile and versatile whole-body behaviors conditioned on language inputs, with smooth transitions between various motions, enabling adaptation to linguistic variations and the emergence of novel motions. We validate the efficacy and generalizability of our method through extensive simulations and real-world experiments, demonstrating robust whole-body control. Please see our website at LangWBC.github.io for more information. △ Less

Submitted 30 April, 2025; originally announced April 2025.

arXiv:2504.20706 [pdf, other]

Every 2-connected, cubic, planar graph with faces of size at most 6 is Hamiltonian

Authors: Sihong Shao, Yuxuan Wu

Abstract: We prove that every 2-connected, cubic, planar graph with faces of size at most 6 is Hamiltonian, and show that the 6-face condition is tight. Our results push the connectivity condition of the Barnette-Goodey conjecture to the weakest possible. We prove that every 2-connected, cubic, planar graph with faces of size at most 6 is Hamiltonian, and show that the 6-face condition is tight. Our results push the connectivity condition of the Barnette-Goodey conjecture to the weakest possible. △ Less

Submitted 29 April, 2025; originally announced April 2025.

arXiv:2504.17504 [pdf, ps, other]

On systems disjoint from all minimal systems

Authors: Wen Huang, Song Shao, Hui Xu, Xiangdong Ye

Abstract: Recently, Górska, Lemańczyk, and de la Rue characterized the class of automorphisms disjoint from all ergodic automorphisms. Inspired by their work, we provide several characterizations of systems that are disjoint from all minimal systems. For a topological dynamical system $(X,T)$, it is disjoint from all minimal systems if and only if there exist minimal subsets $(M_i)_{i\in\mathbb{N}}$ of… ▽ More Recently, Górska, Lemańczyk, and de la Rue characterized the class of automorphisms disjoint from all ergodic automorphisms. Inspired by their work, we provide several characterizations of systems that are disjoint from all minimal systems. For a topological dynamical system $(X,T)$, it is disjoint from all minimal systems if and only if there exist minimal subsets $(M_i)_{i\in\mathbb{N}}$ of $X$ whose union is dense in $X$ and each of them is disjoint from $X$ (we also provide a measure-theoretical analogy of the result). For a semi-simple system $(X,T)$, it is disjoint from all minimal systems if and only if there exists a dense $G_δ$ set $Ω$ in $X \times X$ such that for every pair $(x_1,x_2) \in Ω$, the subsystems $\overline{\mathcal{O}}(x_1,T)$ and $\overline{\mathcal{O}}(x_2,T)$ are disjoint. Furthermore, for a general system a characterization similar to the ergodic case is obtained. △ Less

Submitted 24 April, 2025; originally announced April 2025.

Comments: 32 pages

arXiv:2504.17249 [pdf, other]

Demonstrating Berkeley Humanoid Lite: An Open-source, Accessible, and Customizable 3D-printed Humanoid Robot

Authors: Yufeng Chi, Qiayuan Liao, Junfeng Long, Xiaoyu Huang, Sophia Shao, Borivoje Nikolic, Zhongyu Li, Koushil Sreenath

Abstract: Despite significant interest and advancements in humanoid robotics, most existing commercially available hardware remains high-cost, closed-source, and non-transparent within the robotics community. This lack of accessibility and customization hinders the growth of the field and the broader development of humanoid technologies. To address these challenges and promote democratization in humanoid ro… ▽ More Despite significant interest and advancements in humanoid robotics, most existing commercially available hardware remains high-cost, closed-source, and non-transparent within the robotics community. This lack of accessibility and customization hinders the growth of the field and the broader development of humanoid technologies. To address these challenges and promote democratization in humanoid robotics, we demonstrate Berkeley Humanoid Lite, an open-source humanoid robot designed to be accessible, customizable, and beneficial for the entire community. The core of this design is a modular 3D-printed gearbox for the actuators and robot body. All components can be sourced from widely available e-commerce platforms and fabricated using standard desktop 3D printers, keeping the total hardware cost under $5,000 (based on U.S. market prices). The design emphasizes modularity and ease of fabrication. To address the inherent limitations of 3D-printed gearboxes, such as reduced strength and durability compared to metal alternatives, we adopted a cycloidal gear design, which provides an optimal form factor in this context. Extensive testing was conducted on the 3D-printed actuators to validate their durability and alleviate concerns about the reliability of plastic components. To demonstrate the capabilities of Berkeley Humanoid Lite, we conducted a series of experiments, including the development of a locomotion controller using reinforcement learning. These experiments successfully showcased zero-shot policy transfer from simulation to hardware, highlighting the platform's suitability for research validation. By fully open-sourcing the hardware design, embedded code, and training and deployment frameworks, we aim for Berkeley Humanoid Lite to serve as a pivotal step toward democratizing the development of humanoid robotics. All resources are available at https://lite.berkeley-humanoid.org. △ Less

Submitted 24 April, 2025; originally announced April 2025.

Comments: Accepted in Robotics: Science and Systems (RSS) 2025

arXiv:2504.16960 [pdf, other]

Can Knowledge Improve Security? A Coding-Enhanced Jamming Approach for Semantic Communication

Authors: Weixuan Chen, Qianqian Yang, Shuo Shao, Zhiguo Shi, Jiming Chen, Xuemin, Shen

Abstract: As semantic communication (SemCom) attracts growing attention as a novel communication paradigm, ensuring the security of transmitted semantic information over open wireless channels has become a critical issue. However, traditional encryption methods often introduce significant additional communication overhead to maintain stability, and conventional learning-based secure SemCom methods typically… ▽ More As semantic communication (SemCom) attracts growing attention as a novel communication paradigm, ensuring the security of transmitted semantic information over open wireless channels has become a critical issue. However, traditional encryption methods often introduce significant additional communication overhead to maintain stability, and conventional learning-based secure SemCom methods typically rely on a channel capacity advantage for the legitimate receiver, which is challenging to guarantee in real-world scenarios. In this paper, we propose a coding-enhanced jamming method that eliminates the need to transmit a secret key by utilizing shared knowledge-potentially part of the training set of the SemCom system-between the legitimate receiver and the transmitter. Specifically, we leverage the shared private knowledge base to generate a set of private digital codebooks in advance using neural network (NN)-based encoders. For each transmission, we encode the transmitted data into digital sequence Y1 and associate Y1 with a sequence randomly picked from the private codebook, denoted as Y2, through superposition coding. Here, Y1 serves as the outer code and Y2 as the inner code. By optimizing the power allocation between the inner and outer codes, the legitimate receiver can reconstruct the transmitted data using successive decoding with the index of Y2 shared, while the eavesdropper' s decoding performance is severely degraded, potentially to the point of random guessing. Experimental results demonstrate that our method achieves comparable security to state-of-the-art approaches while significantly improving the reconstruction performance of the legitimate receiver by more than 1 dB across varying channel signal-to-noise ratios (SNRs) and compression ratios. △ Less

Submitted 6 May, 2025; v1 submitted 23 April, 2025; originally announced April 2025.

arXiv:2504.14152 [pdf, ps, other]

FGMP: Fine-Grained Mixed-Precision Weight and Activation Quantization for Hardware-Accelerated LLM Inference

Authors: Coleman Hooper, Charbel Sakr, Ben Keller, Rangharajan Venkatesan, Kurt Keutzer, Sophia Shao, Brucek Khailany

Abstract: Quantization is a powerful tool to improve large language model (LLM) inference efficiency by utilizing more energy-efficient low-precision datapaths and reducing memory footprint. However, accurately quantizing LLM weights and activations to low precision is challenging without degrading model accuracy. We propose fine-grained mixed precision (FGMP) quantization, a post-training mixed-precision q… ▽ More Quantization is a powerful tool to improve large language model (LLM) inference efficiency by utilizing more energy-efficient low-precision datapaths and reducing memory footprint. However, accurately quantizing LLM weights and activations to low precision is challenging without degrading model accuracy. We propose fine-grained mixed precision (FGMP) quantization, a post-training mixed-precision quantization hardware-software co-design methodology that maintains accuracy while quantizing the majority of weights and activations to reduced precision. Our work makes the following contributions: 1) We develop a policy that uses the perturbation in each value, weighted by the Fisher information, to select which weight and activation blocks to keep in higher precision. This approach preserves accuracy by identifying which weight and activation blocks need to be retained in higher precision to minimize the perturbation in the model loss. 2) We also propose a sensitivity-weighted clipping approach for fine-grained quantization which helps retain accuracy for blocks that are quantized to low precision. 3) We then propose hardware augmentations to leverage the efficiency benefits of FGMP quantization. Our hardware implementation encompasses i) datapath support for FGMP at block granularity, and ii) a mixed-precision activation quantization unit to assign activation blocks to high or low precision on the fly with minimal runtime and energy overhead. Our design, prototyped using NVFP4 (an FP4 format with microscaling) as the low-precision datatype and FP8 as the high-precision datatype, facilitates efficient FGMP quantization, attaining <1% perplexity degradation on Wikitext-103 for the Llama-2-7B model relative to an all-FP8 baseline design while consuming 14% less energy during inference and requiring 30% less weight memory. △ Less

Submitted 18 April, 2025; originally announced April 2025.

arXiv:2504.13151 [pdf, ps, other]

MIB: A Mechanistic Interpretability Benchmark

Authors: Aaron Mueller, Atticus Geiger, Sarah Wiegreffe, Dana Arad, Iván Arcuschin, Adam Belfki, Yik Siu Chan, Jaden Fiotto-Kaufman, Tal Haklay, Michael Hanna, Jing Huang, Rohan Gupta, Yaniv Nikankin, Hadas Orgad, Nikhil Prakash, Anja Reusch, Aruna Sankaranarayanan, Shun Shao, Alessandro Stolfo, Martin Tutek, Amir Zur, David Bau, Yonatan Belinkov

Abstract: How can we know whether new mechanistic interpretability methods achieve real improvements? In pursuit of lasting evaluation standards, we propose MIB, a Mechanistic Interpretability Benchmark, with two tracks spanning four tasks and five models. MIB favors methods that precisely and concisely recover relevant causal pathways or causal variables in neural language models. The circuit localization… ▽ More How can we know whether new mechanistic interpretability methods achieve real improvements? In pursuit of lasting evaluation standards, we propose MIB, a Mechanistic Interpretability Benchmark, with two tracks spanning four tasks and five models. MIB favors methods that precisely and concisely recover relevant causal pathways or causal variables in neural language models. The circuit localization track compares methods that locate the model components - and connections between them - most important for performing a task (e.g., attribution patching or information flow routes). The causal variable localization track compares methods that featurize a hidden vector, e.g., sparse autoencoders (SAEs) or distributed alignment search (DAS), and align those features to a task-relevant causal variable. Using MIB, we find that attribution and mask optimization methods perform best on circuit localization. For causal variable localization, we find that the supervised DAS method performs best, while SAE features are not better than neurons, i.e., non-featurized hidden vectors. These findings illustrate that MIB enables meaningful comparisons, and increases our confidence that there has been real progress in the field. △ Less

Submitted 9 June, 2025; v1 submitted 17 April, 2025; originally announced April 2025.

Comments: Accepted to ICML 2025. Project website at https://mib-bench.github.io

arXiv:2504.01570 [pdf, other]

Density estimation via mixture discrepancy and moments

Authors: Zhengyang Lei, Sihong Shao

Abstract: With the aim of generalizing histogram statistics to higher dimensional cases, density estimation via discrepancy based sequential partition (DSP) has been proposed [D. Li, K. Yang, W. Wong, Advances in Neural Information Processing Systems (2016) 1099-1107] to learn an adaptive piecewise constant approximation defined on a binary sequential partition of the underlying domain, where the star discr… ▽ More With the aim of generalizing histogram statistics to higher dimensional cases, density estimation via discrepancy based sequential partition (DSP) has been proposed [D. Li, K. Yang, W. Wong, Advances in Neural Information Processing Systems (2016) 1099-1107] to learn an adaptive piecewise constant approximation defined on a binary sequential partition of the underlying domain, where the star discrepancy is adopted to measure the uniformity of particle distribution. However, the calculation of the star discrepancy is NP-hard and it does not satisfy the reflection invariance and rotation invariance either. To this end, we use the mixture discrepancy and the comparison of moments as a replacement of the star discrepancy, leading to the density estimation via mixture discrepancy based sequential partition (DSP-mix) and density estimation via moments based sequential partition (MSP), respectively. Both DSP-mix and MSP are computationally tractable and exhibit the reflection and rotation invariance. Numerical experiments in reconstructing the $d$-D mixture of Gaussians and Betas with $d=2, 3, \dots, 6$ demonstrate that DSP-mix and MSP both run approximately ten times faster than DSP while maintaining the same accuracy. △ Less

Submitted 2 April, 2025; originally announced April 2025.

arXiv:2504.00587 [pdf, ps, other]

AgentNet: Decentralized Evolutionary Coordination for LLM-based Multi-Agent Systems

Authors: Yingxuan Yang, Huacan Chai, Shuai Shao, Yuanyi Song, Siyuan Qi, Renting Rui, Weinan Zhang

Abstract: The rapid advancement of large language models (LLMs) has enabled the development of multi-agent systems where multiple LLM-based agents collaborate on complex tasks. However, existing systems often rely on centralized coordination, leading to scalability bottlenecks, reduced adaptability, and single points of failure. Privacy and proprietary knowledge concerns further hinder cross-organizational… ▽ More The rapid advancement of large language models (LLMs) has enabled the development of multi-agent systems where multiple LLM-based agents collaborate on complex tasks. However, existing systems often rely on centralized coordination, leading to scalability bottlenecks, reduced adaptability, and single points of failure. Privacy and proprietary knowledge concerns further hinder cross-organizational collaboration, resulting in siloed expertise. We propose AgentNet, a decentralized, Retrieval-Augmented Generation (RAG)-based framework that enables LLM-based agents to specialize, evolve, and collaborate autonomously in a dynamically structured Directed Acyclic Graph (DAG). Unlike prior approaches with static roles or centralized control, AgentNet allows agents to adjust connectivity and route tasks based on local expertise and context. AgentNet introduces three key innovations: (1) a fully decentralized coordination mechanism that eliminates the need for a central orchestrator, enhancing robustness and emergent intelligence; (2) dynamic agent graph topology that adapts in real time to task demands, ensuring scalability and resilience; and (3) a retrieval-based memory system for agents that supports continual skill refinement and specialization. By minimizing centralized control and data exchange, AgentNet enables fault-tolerant, privacy-preserving collaboration across organizations. Experiments show that AgentNet achieves higher task accuracy than both single-agent and centralized multi-agent baselines. △ Less

Submitted 29 May, 2025; v1 submitted 1 April, 2025; originally announced April 2025.

arXiv:2503.20863 [pdf, other]

Additivity, Haag duality, and non-invertible symmetries

Authors: Shu-Heng Shao, Jonathan Sorce, Manu Srivastava

Abstract: The algebraic approach to quantum field theory focuses on the properties of local algebras, whereas the study of (possibly non-invertible) global symmetries emphasizes global aspects of the theory and spacetime. We study connections between these two perspectives by examining how either of two core algebraic properties -- "additivity" or "Haag duality" -- is violated in a 1+1D CFT or lattice model… ▽ More The algebraic approach to quantum field theory focuses on the properties of local algebras, whereas the study of (possibly non-invertible) global symmetries emphasizes global aspects of the theory and spacetime. We study connections between these two perspectives by examining how either of two core algebraic properties -- "additivity" or "Haag duality" -- is violated in a 1+1D CFT or lattice model restricted to the symmetric sector of a general global symmetry. For the Verlinde symmetry of a bosonic diagonal RCFT, we find that additivity is violated whenever the symmetry algebra contains an invertible element, while Haag duality is violated whenever it contains a non-invertible element. We find similar phenomena for the Kramers-Wannier and Rep(D$_8$) non-invertible symmetries on spin chains. △ Less

Submitted 26 March, 2025; originally announced March 2025.

Comments: 22 pages

Report number: MIT-CTP/5853, YITP-SB-2025-06

arXiv:2503.20211 [pdf, other]

Synthetic-to-Real Self-supervised Robust Depth Estimation via Learning with Motion and Structure Priors

Authors: Weilong Yan, Ming Li, Haipeng Li, Shuwei Shao, Robby T. Tan

Abstract: Self-supervised depth estimation from monocular cameras in diverse outdoor conditions, such as daytime, rain, and nighttime, is challenging due to the difficulty of learning universal representations and the severe lack of labeled real-world adverse data. Previous methods either rely on synthetic inputs and pseudo-depth labels or directly apply daytime strategies to adverse conditions, resulting i… ▽ More Self-supervised depth estimation from monocular cameras in diverse outdoor conditions, such as daytime, rain, and nighttime, is challenging due to the difficulty of learning universal representations and the severe lack of labeled real-world adverse data. Previous methods either rely on synthetic inputs and pseudo-depth labels or directly apply daytime strategies to adverse conditions, resulting in suboptimal results. In this paper, we present the first synthetic-to-real robust depth estimation framework, incorporating motion and structure priors to capture real-world knowledge effectively. In the synthetic adaptation, we transfer motion-structure knowledge inside cost volumes for better robust representation, using a frozen daytime model to train a depth estimator in synthetic adverse conditions. In the innovative real adaptation, which targets to fix synthetic-real gaps, models trained earlier identify the weather-insensitive regions with a designed consistency-reweighting strategy to emphasize valid pseudo-labels. We introduce a new regularization by gathering explicit depth distributions to constrain the model when facing real-world data. Experiments show that our method outperforms the state-of-the-art across diverse conditions in multi-frame and single-frame evaluations. We achieve improvements of 7.5% and 4.3% in AbsRel and RMSE on average for nuScenes and Robotcar datasets (daytime, nighttime, rain). In zero-shot evaluation of DrivingStereo (rain, fog), our method generalizes better than the previous ones. △ Less

Submitted 26 March, 2025; originally announced March 2025.

arXiv:2503.13319 [pdf, other]

MagicDistillation: Weak-to-Strong Video Distillation for Large-Scale Few-Step Synthesis

Authors: Shitong Shao, Hongwei Yi, Hanzhong Guo, Tian Ye, Daquan Zhou, Michael Lingelbach, Zhiqiang Xu, Zeke Xie

Abstract: Recently, open-source video diffusion models (VDMs), such as WanX, Magic141 and HunyuanVideo, have been scaled to over 10 billion parameters. These large-scale VDMs have demonstrated significant improvements over smaller-scale VDMs across multiple dimensions, including enhanced visual quality and more natural motion dynamics. However, these models face two major limitations: (1) High inference ove… ▽ More Recently, open-source video diffusion models (VDMs), such as WanX, Magic141 and HunyuanVideo, have been scaled to over 10 billion parameters. These large-scale VDMs have demonstrated significant improvements over smaller-scale VDMs across multiple dimensions, including enhanced visual quality and more natural motion dynamics. However, these models face two major limitations: (1) High inference overhead: Large-scale VDMs require approximately 10 minutes to synthesize a 28-step video on a single H100 GPU. (2) Limited in portrait video synthesis: Models like WanX-I2V and HunyuanVideo-I2V often produce unnatural facial expressions and movements in portrait videos. To address these challenges, we propose MagicDistillation, a novel framework designed to reduce inference overhead while ensuring the generalization of VDMs for portrait video synthesis. Specifically, we primarily use sufficiently high-quality talking video to fine-tune Magic141, which is dedicated to portrait video synthesis. We then employ LoRA to effectively and efficiently fine-tune the fake DiT within the step distillation framework known as distribution matching distillation (DMD). Following this, we apply weak-to-strong (W2S) distribution matching and minimize the discrepancy between the fake data distribution and the ground truth distribution, thereby improving the visual fidelity and motion dynamics of the synthesized videos. Experimental results on portrait video synthesis demonstrate the effectiveness of MagicDistillation, as our method surpasses Euler, LCM, and DMD baselines in both FID/FVD metrics and VBench. Moreover, MagicDistillation, requiring only 4 steps, also outperforms WanX-I2V (14B) and HunyuanVideo-I2V (13B) on visualization and VBench. Our project page is https://magicdistillation.github.io/MagicDistillation/. △ Less

Submitted 31 March, 2025; v1 submitted 17 March, 2025; originally announced March 2025.

arXiv:2503.12387 [pdf, other]

M2UD: A Multi-model, Multi-scenario, Uneven-terrain Dataset for Ground Robot with Localization and Mapping Evaluation

Authors: Yanpeng Jia, Shiyi Wang, Shiliang Shao, Yue Wang, Fu Zhang, Ting Wang

Abstract: Ground robots play a crucial role in inspection, exploration, rescue, and other applications. In recent years, advancements in LiDAR technology have made sensors more accurate, lightweight, and cost-effective. Therefore, researchers increasingly integrate sensors, for SLAM studies, providing robust technical support for ground robots and expanding their application domains. Public datasets are ess… ▽ More Ground robots play a crucial role in inspection, exploration, rescue, and other applications. In recent years, advancements in LiDAR technology have made sensors more accurate, lightweight, and cost-effective. Therefore, researchers increasingly integrate sensors, for SLAM studies, providing robust technical support for ground robots and expanding their application domains. Public datasets are essential for advancing SLAM technology. However, existing datasets for ground robots are typically restricted to flat-terrain motion with 3 DOF and cover only a limited range of scenarios. Although handheld devices and UAV exhibit richer and more aggressive movements, their datasets are predominantly confined to small-scale environments due to endurance limitations. To fill these gap, we introduce M2UD, a multi-modal, multi-scenario, uneven-terrain SLAM dataset for ground robots. This dataset contains a diverse range of highly challenging environments, including cities, open fields, long corridors, and mixed scenarios. Additionally, it presents extreme weather conditions. The aggressive motion and degradation characteristics of this dataset not only pose challenges for testing and evaluating existing SLAM methods but also advance the development of more advanced SLAM algorithms. To benchmark SLAM algorithms, M2UD provides smoothed ground truth localization data obtained via RTK and introduces a novel localization evaluation metric that considers both accuracy and efficiency. Additionally, we utilize a high-precision laser scanner to acquire ground truth maps of two representative scenes, facilitating the development and evaluation of mapping algorithms. We select 12 localization sequences and 2 mapping sequences to evaluate several classical SLAM algorithms, verifying usability of the dataset. To enhance usability, the dataset is accompanied by a suite of development kits. △ Less

Submitted 16 March, 2025; originally announced March 2025.

Comments: 18 pages, 12 figures

arXiv:2503.11254 [pdf, ps, other]

A scalable sequential adaptive cubic regularization algorithm for optimization with general equality constraints

Authors: Yonggang Pei, Shuai Shao, Mauricio Silva Louzeiro, Detong Zhu

Abstract: The scalable adaptive cubic regularization method ($\mathrm{ARC_{q}K}$: Dussault et al. in Math. Program. Ser. A 207(1-2):191-225, 2024) has been recently proposed for unconstrained optimization. It has excellent convergence properties, complexity, and promising numerical performance. In this paper, we extend $\mathrm{ARC_{q}K}$ to large scale nonlinear optimization with general equality constrain… ▽ More The scalable adaptive cubic regularization method ($\mathrm{ARC_{q}K}$: Dussault et al. in Math. Program. Ser. A 207(1-2):191-225, 2024) has been recently proposed for unconstrained optimization. It has excellent convergence properties, complexity, and promising numerical performance. In this paper, we extend $\mathrm{ARC_{q}K}$ to large scale nonlinear optimization with general equality constraints and propose a scalable sequential adaptive cubic regularization algorithm named $\mathrm{SSARC_{q}K}$. In each iteration, we construct an ARC subproblem with linearized constraints inspired by sequential quadratic optimization methods. Then composite-step approach is used to decompose the trial step into the sum of the vertical step and the horizontal step. By means of reduced-Hessian approach, we rewrite the linearity constrained ARC subproblem as a standard unconstrained ARC subproblem to compute the horizontal step. A CG-Lanczos procedure with shifts is employed to solve this subproblem approximately. We provide a new global convergence analysis of the inexact ARC method. Preliminary numerical results are reported to show the performance of $\mathrm{SSARC_{q}K}$. △ Less

Submitted 27 March, 2025; v1 submitted 14 March, 2025; originally announced March 2025.

arXiv:2503.09662 [pdf, other]

CoRe^2: Collect, Reflect and Refine to Generate Better and Faster

Authors: Shitong Shao, Zikai Zhou, Dian Xie, Yuetong Fang, Tian Ye, Lichen Bai, Zeke Xie

Abstract: Making text-to-image (T2I) generative model sample both fast and well represents a promising research direction. Previous studies have typically focused on either enhancing the visual quality of synthesized images at the expense of sampling efficiency or dramatically accelerating sampling without improving the base model's generative capacity. Moreover, nearly all inference methods have not been a… ▽ More Making text-to-image (T2I) generative model sample both fast and well represents a promising research direction. Previous studies have typically focused on either enhancing the visual quality of synthesized images at the expense of sampling efficiency or dramatically accelerating sampling without improving the base model's generative capacity. Moreover, nearly all inference methods have not been able to ensure stable performance simultaneously on both diffusion models (DMs) and visual autoregressive models (ARMs). In this paper, we introduce a novel plug-and-play inference paradigm, CoRe^2, which comprises three subprocesses: Collect, Reflect, and Refine. CoRe^2 first collects classifier-free guidance (CFG) trajectories, and then use collected data to train a weak model that reflects the easy-to-learn contents while reducing number of function evaluations during inference by half. Subsequently, CoRe^2 employs weak-to-strong guidance to refine the conditional output, thereby improving the model's capacity to generate high-frequency and realistic content, which is difficult for the base model to capture. To the best of our knowledge, CoRe^2 is the first to demonstrate both efficiency and effectiveness across a wide range of DMs, including SDXL, SD3.5, and FLUX, as well as ARMs like LlamaGen. It has exhibited significant performance improvements on HPD v2, Pick-of-Pic, Drawbench, GenEval, and T2I-Compbench. Furthermore, CoRe^2 can be seamlessly integrated with the state-of-the-art Z-Sampling, outperforming it by 0.3 and 0.16 on PickScore and AES, while achieving 5.64s time saving using SD3.5.Code is released at https://github.com/xie-lab-ml/CoRe/tree/main. △ Less

Submitted 12 March, 2025; originally announced March 2025.

arXiv:2503.05978 [pdf, other]

MagicInfinite: Generating Infinite Talking Videos with Your Words and Voice

Authors: Hongwei Yi, Tian Ye, Shitong Shao, Xuancheng Yang, Jiantong Zhao, Hanzhong Guo, Terrance Wang, Qingyu Yin, Zeke Xie, Lei Zhu, Wei Li, Michael Lingelbach, Daquan Zhou

Abstract: We present MagicInfinite, a novel diffusion Transformer (DiT) framework that overcomes traditional portrait animation limitations, delivering high-fidelity results across diverse character types-realistic humans, full-body figures, and stylized anime characters. It supports varied facial poses, including back-facing views, and animates single or multiple characters with input masks for precise spe… ▽ More We present MagicInfinite, a novel diffusion Transformer (DiT) framework that overcomes traditional portrait animation limitations, delivering high-fidelity results across diverse character types-realistic humans, full-body figures, and stylized anime characters. It supports varied facial poses, including back-facing views, and animates single or multiple characters with input masks for precise speaker designation in multi-character scenes. Our approach tackles key challenges with three innovations: (1) 3D full-attention mechanisms with a sliding window denoising strategy, enabling infinite video generation with temporal coherence and visual quality across diverse character styles; (2) a two-stage curriculum learning scheme, integrating audio for lip sync, text for expressive dynamics, and reference images for identity preservation, enabling flexible multi-modal control over long sequences; and (3) region-specific masks with adaptive loss functions to balance global textual control and local audio guidance, supporting speaker-specific animations. Efficiency is enhanced via our innovative unified step and cfg distillation techniques, achieving a 20x inference speed boost over the basemodel: generating a 10 second 540x540p video in 10 seconds or 720x720p in 30 seconds on 8 H100 GPUs, without quality loss. Evaluations on our new benchmark demonstrate MagicInfinite's superiority in audio-lip synchronization, identity preservation, and motion naturalness across diverse scenarios. It is publicly available at https://www.hedra.com/, with examples at https://magicinfinite.github.io/. △ Less

Submitted 7 March, 2025; originally announced March 2025.

Comments: MagicInfinite is publicly accessible at https://www.hedra.com/. More examples are at https://magicinfinite.github.io/

arXiv:2503.05794 [pdf, other]

CBW: Towards Dataset Ownership Verification for Speaker Verification via Clustering-based Backdoor Watermarking

Authors: Yiming Li, Kaiying Yan, Shuo Shao, Tongqing Zhai, Shu-Tao Xia, Zhan Qin, Dacheng Tao

Abstract: With the increasing adoption of deep learning in speaker verification, large-scale speech datasets have become valuable intellectual property. To audit and prevent the unauthorized usage of these valuable released datasets, especially in commercial or open-source scenarios, we propose a novel dataset ownership verification method. Our approach introduces a clustering-based backdoor watermark (CBW)… ▽ More With the increasing adoption of deep learning in speaker verification, large-scale speech datasets have become valuable intellectual property. To audit and prevent the unauthorized usage of these valuable released datasets, especially in commercial or open-source scenarios, we propose a novel dataset ownership verification method. Our approach introduces a clustering-based backdoor watermark (CBW), enabling dataset owners to determine whether a suspicious third-party model has been trained on a protected dataset under a black-box setting. The CBW method consists of two key stages: dataset watermarking and ownership verification. During watermarking, we implant multiple trigger patterns in the dataset to make similar samples (measured by their feature similarities) close to the same trigger while dissimilar samples are near different triggers. This ensures that any model trained on the watermarked dataset exhibits specific misclassification behaviors when exposed to trigger-embedded inputs. To verify dataset ownership, we design a hypothesis-test-based framework that statistically evaluates whether a suspicious model exhibits the expected backdoor behavior. We conduct extensive experiments on benchmark datasets, verifying the effectiveness and robustness of our method against potential adaptive attacks. The code for reproducing main experiments is available at https://github.com/Radiant0726/CBW △ Less

Submitted 5 April, 2025; v1 submitted 1 March, 2025; originally announced March 2025.

Comments: 14 pages. The journal extension of our ICASSP'21 paper (arXiv:2010.11607)

arXiv:2503.02925 [pdf, other]

Gauging non-invertible symmetries on the lattice

Authors: Sahand Seifnashri, Shu-Heng Shao, Xinping Yang

Abstract: We provide a general prescription for gauging finite non-invertible symmetries in 1+1d lattice Hamiltonian systems. Our primary example is the Rep(D$_8$) fusion category generated by the Kennedy-Tasaki transformation, which is the simplest anomaly-free non-invertible symmetry on a spin chain of qubits. We explicitly compute its lattice F-symbols and illustrate our prescription for a particular (no… ▽ More We provide a general prescription for gauging finite non-invertible symmetries in 1+1d lattice Hamiltonian systems. Our primary example is the Rep(D$_8$) fusion category generated by the Kennedy-Tasaki transformation, which is the simplest anomaly-free non-invertible symmetry on a spin chain of qubits. We explicitly compute its lattice F-symbols and illustrate our prescription for a particular (non-maximal) gauging of this symmetry. In our gauging procedure, we introduce two qubits around each link, playing the role of "gauge fields" for the non-invertible symmetry, and impose novel Gauss's laws. Similar to the Kramers-Wannier transformation for gauging an ordinary $\mathbb{Z}_2$, our gauging can be summarized by a gauging map, which is part of a larger, continuous non-invertible cosine symmetry. △ Less

Submitted 4 March, 2025; originally announced March 2025.

Comments: 66 pages, 1 figure, 1 table

Report number: MIT-CTP/5842, YITP-SB-2025-03

arXiv:2502.19628 [pdf, other]

doi 10.1145/3701716.3715589

PCL: Prompt-based Continual Learning for User Modeling in Recommender Systems

Authors: Mingdai Yang, Fan Yang, Yanhui Guo, Shaoyuan Xu, Tianchen Zhou, Yetian Chen, Simone Shao, Jia Liu, Yan Gao

Abstract: User modeling in large e-commerce platforms aims to optimize user experiences by incorporating various customer activities. Traditional models targeting a single task often focus on specific business metrics, neglecting the comprehensive user behavior, and thus limiting their effectiveness. To develop more generalized user representations, some existing work adopts Multi-task Learning (MTL)approac… ▽ More User modeling in large e-commerce platforms aims to optimize user experiences by incorporating various customer activities. Traditional models targeting a single task often focus on specific business metrics, neglecting the comprehensive user behavior, and thus limiting their effectiveness. To develop more generalized user representations, some existing work adopts Multi-task Learning (MTL)approaches. But they all face the challenges of optimization imbalance and inefficiency in adapting to new tasks. Continual Learning (CL), which allows models to learn new tasks incrementally and independently, has emerged as a solution to MTL's limitations. However, CL faces the challenge of catastrophic forgetting, where previously learned knowledge is lost when the model is learning the new task. Inspired by the success of prompt tuning in Pretrained Language Models (PLMs), we propose PCL, a Prompt-based Continual Learning framework for user modeling, which utilizes position-wise prompts as external memory for each task, preserving knowledge and mitigating catastrophic forgetting. Additionally, we design contextual prompts to capture and leverage inter-task relationships during prompt tuning. We conduct extensive experiments on real-world datasets to demonstrate PCL's effectiveness. △ Less

Submitted 26 February, 2025; originally announced February 2025.

Comments: 5 pages. Accepted by www'25 as short paper

arXiv:2502.18508 [pdf, other]

REFINE: Inversion-Free Backdoor Defense via Model Reprogramming

Authors: Yukun Chen, Shuo Shao, Enhao Huang, Yiming Li, Pin-Yu Chen, Zhan Qin, Kui Ren

Abstract: Backdoor attacks on deep neural networks (DNNs) have emerged as a significant security threat, allowing adversaries to implant hidden malicious behaviors during the model training phase. Pre-processing-based defense, which is one of the most important defense paradigms, typically focuses on input transformations or backdoor trigger inversion (BTI) to deactivate or eliminate embedded backdoor trigg… ▽ More Backdoor attacks on deep neural networks (DNNs) have emerged as a significant security threat, allowing adversaries to implant hidden malicious behaviors during the model training phase. Pre-processing-based defense, which is one of the most important defense paradigms, typically focuses on input transformations or backdoor trigger inversion (BTI) to deactivate or eliminate embedded backdoor triggers during the inference process. However, these methods suffer from inherent limitations: transformation-based defenses often fail to balance model utility and defense performance, while BTI-based defenses struggle to accurately reconstruct trigger patterns without prior knowledge. In this paper, we propose REFINE, an inversion-free backdoor defense method based on model reprogramming. REFINE consists of two key components: \textbf{(1)} an input transformation module that disrupts both benign and backdoor patterns, generating new benign features; and \textbf{(2)} an output remapping module that redefines the model's output domain to guide the input transformations effectively. By further integrating supervised contrastive loss, REFINE enhances the defense capabilities while maintaining model utility. Extensive experiments on various benchmark datasets demonstrate the effectiveness of our REFINE and its resistance to potential adaptive attacks. △ Less

Submitted 22 February, 2025; originally announced February 2025.

Comments: This paper is accept by ICLR 2025. The first two authors contributed equally to this work. Our code is available at BackdoorBox (https://github.com/THUYimingLi/BackdoorBox) and Github repository (https://github.com/WhitolfChen/REFINE). 28 pages

arXiv:2502.17088 [pdf, other]

Where are the earliest stars relics in the simulated Milky Way analogues?

Authors: Hang Yang, Liang Gao, Qi Guo, Haining Li, Shi Shao, Gang Zhao

Abstract: Using 6 Milky Way analogues with two different numerical resolutions from the Auriga simulation, we investigate the total mass, spatial distribution and kinematics of the earliest stars relics in the Milky Way at $z=0$. These relics (second generation stars) formed over a wide redshift range, from about $z=22$ to $z=4$, with an average formation redshift of $z \sim 10.0$, and comprise about… ▽ More Using 6 Milky Way analogues with two different numerical resolutions from the Auriga simulation, we investigate the total mass, spatial distribution and kinematics of the earliest stars relics in the Milky Way at $z=0$. These relics (second generation stars) formed over a wide redshift range, from about $z=22$ to $z=4$, with an average formation redshift of $z \sim 10.0$, and comprise about $2\times10^{-5}$ of the entire galactic stellar population. The disk and bulge components host only a small fraction of these relics, contributing less than $12$ percent in total. The stellar halo, in particular the outer stellar halo of which galactic radius $r>30$ kpc, hosts the largest fraction (about 46 percent on average), with an average of one relic star for per $4,000$ to $10,000$ stars, making it a promising region for observational searches. Additionally, around $18$ percent of the earliest stars relics are found in satellite galaxies, with smaller and older satellite galaxies tending to contain a higher proportion of these stars. Thus, low-mass and early-formed satellite galaxies are also ideal targets for finding such relics, although some satellite galaxies may lack them entirely. The spatial distribution and kinematics of these stars show good numerical convergence across different simulation resolutions. Our results provide valuable guidance for searches of the earliest stars relics and offer insights for interpreting findings from ongoing and future stellar archaeology surveys. △ Less

Submitted 24 February, 2025; originally announced February 2025.

Comments: 9 pages, 6 figures. Submitted to ApJ

arXiv:2502.13575 [pdf, ps, other]

ETS: Efficient Tree Search for Inference-Time Scaling

Authors: Coleman Hooper, Sehoon Kim, Suhong Moon, Kerem Dilmen, Monishwaran Maheswaran, Nicholas Lee, Michael W. Mahoney, Sophia Shao, Kurt Keutzer, Amir Gholami

Abstract: Test-time compute scaling has emerged as a new axis along which to improve model accuracy, where additional computation is used at inference time to allow the model to think longer for more challenging problems. One promising approach for test-time compute scaling is search against a process reward model, where a model generates multiple potential candidates at each step of the search, and these p… ▽ More Test-time compute scaling has emerged as a new axis along which to improve model accuracy, where additional computation is used at inference time to allow the model to think longer for more challenging problems. One promising approach for test-time compute scaling is search against a process reward model, where a model generates multiple potential candidates at each step of the search, and these partial trajectories are then scored by a separate reward model in order to guide the search process. The diversity of trajectories in the tree search process affects the accuracy of the search, since increasing diversity promotes more exploration. However, this diversity comes at a cost, as divergent trajectories have less KV sharing, which means they consume more memory and slow down the search process. Previous search methods either do not perform sufficient exploration, or else explore diverse trajectories but have high latency. We address this challenge by proposing Efficient Tree Search (ETS), which promotes KV sharing by pruning redundant trajectories while maintaining necessary diverse trajectories. ETS incorporates a linear programming cost model to promote KV cache sharing by penalizing the number of nodes retained, while incorporating a semantic coverage term into the cost model to ensure that we retain trajectories which are semantically different. We demonstrate how ETS can achieve 1.8$\times$ reduction in average KV cache size during the search process, leading to 1.4$\times$ increased throughput relative to prior state-of-the-art methods, with minimal accuracy degradation and without requiring any custom kernel implementation. Code is available at: https://github.com/SqueezeAILab/ETS. △ Less

Submitted 11 June, 2025; v1 submitted 19 February, 2025; originally announced February 2025.

Comments: 15 pages

arXiv:2502.08537 [pdf]

doi 10.1038/s41467-025-58262-y

Broken symmetries associated with a Kagome chiral charge order

Authors: Zi-Jia Cheng, Md Shafayat Hossain, Qi Zhang, Sen Shao, Jinjin Liu, Yilin Zhao, Mohammad Yahyavi, Yu-Xiao Jiang, Jia-Xin Yin, Xian Yang, Yongkai Li, Tyler A. Cochran, Maksim Litskevich, Byunghoon Kim, Junyi Zhang, Yugui Yao, Luis Balicas, Zhiwei Wang, Guoqing Chang, M. Zahid Hasan

Abstract: Chirality or handedness manifests in all fields of science, ranging from cell biology, molecular interaction, and catalysis to different branches of physics. In condensed matter physics, chirality is intrinsic to enigmatic quantum phases, such as chiral charge density waves and chiral superconductivity. Here, the underlying chiral response is subtle and leads to broken symmetries in the ground sta… ▽ More Chirality or handedness manifests in all fields of science, ranging from cell biology, molecular interaction, and catalysis to different branches of physics. In condensed matter physics, chirality is intrinsic to enigmatic quantum phases, such as chiral charge density waves and chiral superconductivity. Here, the underlying chiral response is subtle and leads to broken symmetries in the ground state. Detection of subtle broken symmetries is the key to understand these quantum states but they are extremely challenging to expose leading to debate and controversy. Here, using second-order optical response, we uncover the broken symmetries of a chiral charge density wave in the Kagome lattice KV3Sb5, revealing the relevant broken symmetries of its charge order. KV3Sb5 undergoes a phase transition to a charge-ordered state at low temperatures. Our polarization-dependent mid-infrared photocurrent microscopy reveals an intrinsic, longitudinal helicity-dependent photocurrent associated with the charge order. Our measurements, supported by our theoretical analysis, provide direct evidence for broken inversion and mirror symmetries at the charge order transition, indicating a chiral charge ordered state. On the other hand, we do not observe a circular photogalvanic effect along the direction perpendicular to that of the incident light, imposing stringent constraints on the rotational and point group symmetries of the charge order. Our study not only visualizes the chiral nature of the Kagome charge order revealing its broken symmetries, but also highlights the nonlinear photogalvanic effect as a sensitive probe for detecting subtle symmetry breakings. △ Less

Submitted 12 February, 2025; originally announced February 2025.

Comments: in press

Journal ref: Nature Communications volume 16, Article number: 3782 (2025)

arXiv:2502.08048 [pdf, other]

Efficiently Laser Driven Terahertz Surface Plasmon Polaritons on Long Metal Wire

Authors: Shuoting Shao, Xiangbing Wang, Rong Huang, Guangyue Hu, Min Chen, Huibo Tang, Longyu Kuang, Yuxi Liu, Yuqiu Gu, Yongkun Ding, Ruxin Li, Hongbin Zhuo, Mingyang Yu

Abstract: We experimentally demonstrate a novel scheme for efficiently generating intense terahertz (THz) surface plasmon polaritons (SPPs) on a sub-wavelength-diameter meter-long metal wire. Driven by a subrelativistic femtosecond laser (a0=0.3, 3 mJ) focused at the wire's midpoint, single-cycle ten-megawatt THz SPPs are excited and propagating bidirectionally along it over 25 cm. The measured laser-to-SPP… ▽ More We experimentally demonstrate a novel scheme for efficiently generating intense terahertz (THz) surface plasmon polaritons (SPPs) on a sub-wavelength-diameter meter-long metal wire. Driven by a subrelativistic femtosecond laser (a0=0.3, 3 mJ) focused at the wire's midpoint, single-cycle ten-megawatt THz SPPs are excited and propagating bidirectionally along it over 25 cm. The measured laser-to-SPPs energy conversion efficiency is reaching up to ~2.4%, which is the highest value at present. It is proved that the THz SPPs are excited by coherent transition radiation of the subrelativistic laser produced escaping electrons. Particle-in-cell together with CST simulations confirm the experimental observations. Our scheme of using readily available subrelativistic laser should thus be useful to applications requiring terawatt level single-cycle THz SPPs. △ Less

Submitted 21 February, 2025; v1 submitted 11 February, 2025; originally announced February 2025.

arXiv:2502.07701

Magic 1-For-1: Generating One Minute Video Clips within One Minute

Authors: Hongwei Yi, Shitong Shao, Tian Ye, Jiantong Zhao, Qingyu Yin, Michael Lingelbach, Li Yuan, Yonghong Tian, Enze Xie, Daquan Zhou

Abstract: In this technical report, we present Magic 1-For-1 (Magic141), an efficient video generation model with optimized memory consumption and inference latency. The key idea is simple: factorize the text-to-video generation task into two separate easier tasks for diffusion step distillation, namely text-to-image generation and image-to-video generation. We verify that with the same optimization algorit… ▽ More In this technical report, we present Magic 1-For-1 (Magic141), an efficient video generation model with optimized memory consumption and inference latency. The key idea is simple: factorize the text-to-video generation task into two separate easier tasks for diffusion step distillation, namely text-to-image generation and image-to-video generation. We verify that with the same optimization algorithm, the image-to-video task is indeed easier to converge over the text-to-video task. We also explore a bag of optimization tricks to reduce the computational cost of training the image-to-video (I2V) models from three aspects: 1) model convergence speedup by using a multi-modal prior condition injection; 2) inference latency speed up by applying an adversarial step distillation, and 3) inference memory cost optimization with parameter sparsification. With those techniques, we are able to generate 5-second video clips within 3 seconds. By applying a test time sliding window, we are able to generate a minute-long video within one minute with significantly improved visual quality and motion dynamics, spending less than 1 second for generating 1 second video clips on average. We conduct a series of preliminary explorations to find out the optimal tradeoff between computational cost and video quality during diffusion step distillation and hope this could be a good foundation model for open-source explorations. The code and the model weights are available at https://github.com/DA-Group-PKU/Magic-1-For-1. △ Less

Submitted 16 February, 2025; v1 submitted 11 February, 2025; originally announced February 2025.

Comments: Serious updates are needed

arXiv:2502.07644 [pdf, other]

SymGPT: Auditing Smart Contracts via Combining Symbolic Execution with Large Language Models

Authors: Shihao Xia, Mengting He, Shuai Shao, Tingting Yu, Yiying Zhang, Linhai Song

Abstract: To govern smart contracts running on Ethereum, multiple Ethereum Request for Comment (ERC) standards have been developed, each having a set of rules to guide the behaviors of smart contracts. Violating the ERC rules could cause serious security issues and financial loss, signifying the importance of verifying smart contracts follow ERCs. Today's practices of such verification are to manually audit… ▽ More To govern smart contracts running on Ethereum, multiple Ethereum Request for Comment (ERC) standards have been developed, each having a set of rules to guide the behaviors of smart contracts. Violating the ERC rules could cause serious security issues and financial loss, signifying the importance of verifying smart contracts follow ERCs. Today's practices of such verification are to manually audit each single contract, use expert-developed program-analysis tools, or use large language models (LLMs), all of which are far from effective in identifying ERC rule violations. This paper introduces SymGPT, a tool that combines the natural language understanding of large language models (LLMs) with the formal guarantees of symbolic execution to automatically verify smart contracts' compliance with ERC rules. To develop SymGPT, we conduct an empirical study of 132 ERC rules from three widely used ERC standards, examining their content, security implications, and natural language descriptions. Based on this study, we design SymGPT by first instructing an LLM to translate ERC rules into a defined EBNF grammar. We then synthesize constraints from the formalized rules to represent scenarios where violations may occur and use symbolic execution to detect them. Our evaluation shows that SymGPT identifies 5,783 ERC rule violations in 4,000 real-world contracts, including 1,375 violations with clear attack paths for stealing financial assets, demonstrating its effectiveness. Furthermore, SymGPT outperforms six automated techniques and a security-expert auditing service, underscoring its superiority over current smart contract analysis methods. △ Less

Submitted 12 February, 2025; v1 submitted 11 February, 2025; originally announced February 2025.

Comments: 16 pages. arXiv admin note: text overlap with arXiv:2404.04306

arXiv:2501.15509 [pdf, other]

FIT-Print: Towards False-claim-resistant Model Ownership Verification via Targeted Fingerprint

Authors: Shuo Shao, Haozhe Zhu, Hongwei Yao, Yiming Li, Tianwei Zhang, Zhan Qin, Kui Ren

Abstract: Model fingerprinting is a widely adopted approach to safeguard the copyright of open-source models by detecting and preventing their unauthorized reuse without modifying the protected model. However, in this paper, we reveal that existing fingerprinting methods are vulnerable to false claim attacks where adversaries falsely assert ownership of third-party non-reused models. We find that this vulne… ▽ More Model fingerprinting is a widely adopted approach to safeguard the copyright of open-source models by detecting and preventing their unauthorized reuse without modifying the protected model. However, in this paper, we reveal that existing fingerprinting methods are vulnerable to false claim attacks where adversaries falsely assert ownership of third-party non-reused models. We find that this vulnerability mostly stems from their untargeted nature, where they generally compare the outputs of given samples on different models instead of the similarities to specific references. Motivated by this finding, we propose a targeted fingerprinting paradigm (i.e., FIT-Print) to counteract false claim attacks. Specifically, FIT-Print transforms the fingerprint into a targeted signature via optimization. Building on the principles of FIT-Print, we develop bit-wise and list-wise black-box model fingerprinting methods, i.e., FIT-ModelDiff and FIT-LIME, which exploit the distance between model outputs and the feature attribution of specific samples as the fingerprint, respectively. Experiments on benchmark models and datasets verify the effectiveness, conferrability, and resistance to false claim attacks of our FIT-Print. △ Less

Submitted 23 May, 2025; v1 submitted 26 January, 2025; originally announced January 2025.

arXiv:2501.13260 [pdf]

Field induced density wave in a kagome superconductor

Authors: Md Shafayat Hossain, Qi Zhang, Julian Ingham, Jinjin Liu, Sen Shao, Yangmu Li, Yuxin Wang, Bal K. Pokharel, Zi-Jia Cheng, Yu-Xiao Jiang, Maksim Litskevich, Byunghoon Kim, Xian Yang, Yongkai Li, Tyler A. Cochran, Yugui Yao, Dragana Popović, Zhiwei Wang, Guoqing Chang, Ronny Thomale, Luis Balicas, M. Zahid Hasan

Abstract: On the kagome lattice, electrons benefit from the simultaneous presence of band topology, flat electronic bands, and van Hove singularities, forming competing or cooperating orders. Understanding the interrelation between these distinct order parameters remains a significant challenge, leaving much of the associated physics unexplored. In the kagome superconductor KV3Sb5, which exhibits a charge d… ▽ More On the kagome lattice, electrons benefit from the simultaneous presence of band topology, flat electronic bands, and van Hove singularities, forming competing or cooperating orders. Understanding the interrelation between these distinct order parameters remains a significant challenge, leaving much of the associated physics unexplored. In the kagome superconductor KV3Sb5, which exhibits a charge density wave (CDW) state below T = 78 K, we uncover an unpredicted field-induced phase transition below 6 K. The observed transition is marked by a hysteretic anomaly in the resistivity, nonlinear electrical transport, and a change in the symmetry of the electronic response as probed via the angular dependence of the magnetoresistivity. These observations surprisingly suggest the emergence of an unanticipated broken symmetry state coexisting with the original CDW. To understand this experimental observation, we developed a theoretical minimal model for the normal state inside the high-temperature parent CDW phase where an incommensurate CDW order emerges as an instability sub-leading to superconductivity. The incommensurate CDW emerges when superconducting fluctuations become fully suppressed by large magnetic fields. Our results suggest that, in kagome superconductors, quantum states can either coexist or are nearly degenerate in energy, indicating that these are rich platforms to expose new correlated phenomena. △ Less

Submitted 22 January, 2025; originally announced January 2025.

arXiv:2501.09520 [pdf, other]

RWZC: A Model-Driven Approach for Learning-based Robust Wyner-Ziv Coding

Authors: Yuxuan Shi, Shuo Shao, Yongpeng Wu, Wenjun Zhang, Merouane Debbah

Abstract: In this paper, a novel learning-based Wyner-Ziv coding framework is considered under a distributed image transmission scenario, where the correlated source is only available at the receiver. Unlike other learnable frameworks, our approach demonstrates robustness to non-stationary source correlation, where the overlapping information between image pairs varies. Specifically, we first model the affi… ▽ More In this paper, a novel learning-based Wyner-Ziv coding framework is considered under a distributed image transmission scenario, where the correlated source is only available at the receiver. Unlike other learnable frameworks, our approach demonstrates robustness to non-stationary source correlation, where the overlapping information between image pairs varies. Specifically, we first model the affine relationship between correlated images and leverage this model for learnable mask generation and rate-adaptive joint source-channel coding. Moreover, we also provide a warping-prediction network to remove the distortion from channel interference and affine transform. Intuitively, the observed performance improvement is largely due to focusing on the simple geometric relationship, rather than the complex joint distribution between the sources. Numerical results show that our framework achieves a 1.5 dB gain in PSNR and a 0.2 improvement in MS-SSIM, along with a significant superiority in perceptual metrics, compared to state-of-the-art methods when applied to real-world samples with non-stationary correlations. △ Less

Submitted 5 February, 2025; v1 submitted 16 January, 2025; originally announced January 2025.

Comments: 14 pages, 17 figures, accepted by IEEE Journal on Selected Areas in Communications

arXiv:2501.09282 [pdf, other]

Diatomic and Polyatomic Heteronuclear Ultralong-Range Rydberg Molecules

Authors: Qing Li, Shi-Yao Shao, Li-Hua Zhang, Bang Liu, Zheng-Yuan Zhang, Jun Zhang, Qi-Feng Wang, Han-Chao Chen, Yu Ma, Tian-Yu Han, Dong-Sheng Ding, Bao-Sen Shi

Abstract: Ultra-long-range Rydberg molecules (ULRMs) have attracted significant interest due to their unique electronic properties and potential applications in quantum technologies. We theoretically investigate the formation and characteristics of heteronuclear ULRMs, focusing on Rb-Cs systems. We explore the vibrational energy levels of heteronuclear nD ULRMs and compare them with homonuclear counterparts… ▽ More Ultra-long-range Rydberg molecules (ULRMs) have attracted significant interest due to their unique electronic properties and potential applications in quantum technologies. We theoretically investigate the formation and characteristics of heteronuclear ULRMs, focusing on Rb-Cs systems. We explore the vibrational energy levels of heteronuclear nD ULRMs and compare them with homonuclear counterparts. We also predict the formation of polyatomic heteronuclear ULRMs, discussing how the binding energy and spectral features evolve as the number of ground-state atoms increases. Our theoretical predictions are presented in terms of molecular spectra and provide insight into the formation dynamics of these systems. The study further explores the potential applications of heteronuclear ULRMs in quantum information processing, quantum simulation, and precision measurements, offering new avenues for future research in many-body physics and quantum technologies. △ Less

Submitted 15 January, 2025; originally announced January 2025.

arXiv:2501.00051 [pdf, other]

DDD-GenDT: Dynamic Data-driven Generative Digital Twin Framework

Authors: Yu-Zheng Lin, Qinxuan Shi, Zhanglong Yang, Banafsheh Saber Latibari, Sicong Shao, Soheil Salehi, Pratik Satam

Abstract: Digital twin (DT) technology has emerged as a transformative approach to simulate, predict, and optimize the behavior of physical systems, with applications that span manufacturing, healthcare, climate science, and more. However, the development of DT models often faces challenges such as high data requirements, integration complexity, and limited adaptability to dynamic changes in physical system… ▽ More Digital twin (DT) technology has emerged as a transformative approach to simulate, predict, and optimize the behavior of physical systems, with applications that span manufacturing, healthcare, climate science, and more. However, the development of DT models often faces challenges such as high data requirements, integration complexity, and limited adaptability to dynamic changes in physical systems. This paper presents a new method inspired by dynamic data-driven applications systems (DDDAS), called the dynamic data-driven generative of digital twins framework (DDD-GenDT), which combines the physical system with LLM, allowing LLM to act as DT to interact with the physical system operating status and generate the corresponding physical behaviors. We apply DDD-GenDT to the computer numerical control (CNC) machining process, and we use the spindle current measurement data in the NASA milling wear data set as an example to enable LLMs to forecast the physical behavior from historical data and interact with current observations. Experimental results show that in the zero-shot prediction setting, the LLM-based DT can adapt to the change in the system, and the average RMSE of the GPT-4 prediction is 0.479A, which is 4.79% of the maximum spindle motor current measurement of 10A, with little training data and instructions required. Furthermore, we analyze the performance of DDD-GenDT in this specific application and their potential to construct digital twins. We also discuss the limitations and challenges that may arise in practical implementations. △ Less

Submitted 27 December, 2024; originally announced January 2025.

arXiv:2412.18606 [pdf, other]

doi 10.21468/SciPostPhys.18.4.121

Lattice T-duality from non-invertible symmetries in quantum spin chains

Authors: Salvatore D. Pace, Arkya Chatterjee, Shu-Heng Shao

Abstract: Dualities of quantum field theories are challenging to realize in lattice models of qubits. In this work, we explore one of the simplest dualities, T-duality of the compact boson CFT, and its realization in quantum spin chains. In the special case of the XX model, we uncover an exact lattice T-duality, which is associated with a non-invertible symmetry that exchanges two lattice U(1) symmetries. T… ▽ More Dualities of quantum field theories are challenging to realize in lattice models of qubits. In this work, we explore one of the simplest dualities, T-duality of the compact boson CFT, and its realization in quantum spin chains. In the special case of the XX model, we uncover an exact lattice T-duality, which is associated with a non-invertible symmetry that exchanges two lattice U(1) symmetries. The latter symmetries flow to the momentum and winding U(1) symmetries with a mixed anomaly in the CFT. However, the charge operators of the two U(1) symmetries do not commute on the lattice and instead generate the Onsager algebra. We discuss how some of the anomalies in the CFT are nonetheless still exactly realized on the lattice and how the lattice U(1) symmetries enforce gaplessness. We further explore lattice deformations preserving both U(1) symmetries and find a rich gapless phase diagram with special $\mathrm{Spin}(2k)_1$ WZW model points and whose phase transitions all have dynamical exponent ${z>1}$. △ Less

Submitted 8 April, 2025; v1 submitted 24 December, 2024; originally announced December 2024.

Comments: 45 pages plus appendices. v2: published version

Report number: MIT-CTP/5815, YITP-SB-2024-34

Journal ref: SciPost Phys. 18, 121 (2025)

arXiv:2412.18263 [pdf, other]

High-Rank Irreducible Cartesian Tensor Decomposition and Bases of Equivariant Spaces

Authors: Shihao Shao, Yikang Li, Zhouchen Lin, Qinghua Cui

Abstract: Irreducible Cartesian tensors (ICTs) play a crucial role in the design of equivariant graph neural networks, as well as in theoretical chemistry and chemical physics. Meanwhile, the design space of available linear operations on tensors that preserve symmetry presents a significant challenge. The ICT decomposition and a basis of this equivariant space are difficult to obtain for high-rank tensors.… ▽ More Irreducible Cartesian tensors (ICTs) play a crucial role in the design of equivariant graph neural networks, as well as in theoretical chemistry and chemical physics. Meanwhile, the design space of available linear operations on tensors that preserve symmetry presents a significant challenge. The ICT decomposition and a basis of this equivariant space are difficult to obtain for high-rank tensors. After decades of research, Bonvicini (2024) recently achieves an explicit ICT decomposition for $n=5$ with factorial time/space complexity. In this work we, for the first time, obtains decomposition matrices for ICTs up to rank $n=9$ with reduced and affordable complexity, by constructing what we call path matrices. The path matrices are obtained via performing chain-like contractions with Clebsch-Gordan matrices following the parentage scheme. We prove and leverage that the concatenation of path matrices is an orthonormal change-of-basis matrix between the Cartesian tensor product space and the spherical direct sum spaces. Furthermore, we identify a complete orthogonal basis for the equivariant space, rather than a spanning set (Pearce-Crump, 2023), through this path matrices technique. To the best of our knowledge, this is also the first analytic, rather than numerical, method for theoretically obtaining arbitrary rank orthogonal ICT decomposition matrices and orthogonal equivariant bases. We further extend our result to the arbitrary tensor product and direct sum spaces, enabling free design between different spaces while keeping symmetry. The Python code is available at https://github.com/ShihaoShao-GH/ICT-decomposition-and-equivariant-bases, where the $n=6,\dots,9$ ICT decomposition matrices are obtained in 1s, 3s, 11s, and 4m32s on 28-cores Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz, respectively. △ Less

Submitted 19 March, 2025; v1 submitted 24 December, 2024; originally announced December 2024.

Comments: 48 pages

Showing 1–50 of 489 results for author: Sha, S