Search | arXiv e-print repository

SmartEraser: Remove Anything from Images using Masked-Region Guidance

Authors: Longtao Jiang, Zhendong Wang, Jianmin Bao, Wengang Zhou, Dongdong Chen, Lei Shi, Dong Chen, Houqiang Li

Abstract: Object removal has so far been dominated by the mask-and-inpaint paradigm, where the masked region is excluded from the input, leaving models relying on unmasked areas to inpaint the missing region. However, this approach lacks contextual information for the masked area, often resulting in unstable performance. In this work, we introduce SmartEraser, built with a new removing paradigm called Maske… ▽ More Object removal has so far been dominated by the mask-and-inpaint paradigm, where the masked region is excluded from the input, leaving models relying on unmasked areas to inpaint the missing region. However, this approach lacks contextual information for the masked area, often resulting in unstable performance. In this work, we introduce SmartEraser, built with a new removing paradigm called Masked-Region Guidance. This paradigm retains the masked region in the input, using it as guidance for the removal process. It offers several distinct advantages: (a) it guides the model to accurately identify the object to be removed, preventing its regeneration in the output; (b) since the user mask often extends beyond the object itself, it aids in preserving the surrounding context in the final result. Leveraging this new paradigm, we present Syn4Removal, a large-scale object removal dataset, where instance segmentation data is used to copy and paste objects onto images as removal targets, with the original images serving as ground truths. Experimental results demonstrate that SmartEraser significantly outperforms existing methods, achieving superior performance in object removal, especially in complex scenes with intricate compositions. △ Less

Submitted 11 June, 2025; v1 submitted 14 January, 2025; originally announced January 2025.

Comments: Project at: https://longtaojiang.github.io/smarteraser.github.io/

Journal ref: The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2025

arXiv:2501.07579 [pdf]

doi 10.1667/RADE-24-00277.1

Correlation Between DNA Double-Strand Break Distribution in 3D Genome and Radiation-Induced Cell Death

Authors: Ankang Hu, Wanyi Zhou, Xiyu Luo, Rui Qiu, Junli Li

Abstract: The target theory is the most classical hypothesis explaining radiation-induced cell death, the physical or biological nature of the "target" remains ambiguous. This study hypothesizes that the distribution of DNA double-strand breaks (DSBs) within the 3D genome is a pivotal factor affecting the probability of radiation-induced cell death. We propose that clustered DSBs in DNA segments with high i… ▽ More The target theory is the most classical hypothesis explaining radiation-induced cell death, the physical or biological nature of the "target" remains ambiguous. This study hypothesizes that the distribution of DNA double-strand breaks (DSBs) within the 3D genome is a pivotal factor affecting the probability of radiation-induced cell death. We propose that clustered DSBs in DNA segments with high interaction frequencies are more susceptible to leading to cell death than isolated DSBs. Topologically associating domains (TAD) can be regarded as the reference unit for evaluating the impact of DSB clustering in the 3D genome. To quantify this correlation between the DSB distribution in 3D genome and radiation-induced effect, we developed a simplified model considering the DSB distribution across TADs. Utilizing track-structure Monte Carlo codes to simulate the electron and carbon ion irradiation, we calculated the incidence of each case across a variety of radiation doses and LETs. Our simulation results indicate that DSBs in TADs with frequent interactions (case 3) are significantly more likely to induce cell death than clustered DSBs within a single TAD (case 2). Moreover, case 2 is significantly more likely to induce cell death than isolated DSBs (case 1). The curves of the incidence of case 2 and case 3 versus LETs have a similar shape to the radiation quality factor used in radiation protection. This indicates that these two cases are also associated with the stochastic effects induced by high LET irradiation. Our study underscores the significance of the 3D genome structure in the fundamental mechanisms of radiobiological effects. The hypothesis in our research offers novel perspectives on the mechanisms that regulate radiobiological effects. Moreover, it serves as a valuable reference for establishing mechanistic models that can predict cell survival under different doses and LETs. △ Less

Submitted 9 June, 2025; v1 submitted 27 December, 2024; originally announced January 2025.

Comments: 19 pages, 6 figures, 1 supplementary document

Journal ref: Radiation Research. 2025, 203(6): 421-432

arXiv:2501.06835 [pdf, other]

X-LeBench: A Benchmark for Extremely Long Egocentric Video Understanding

Authors: Wenqi Zhou, Kai Cao, Hao Zheng, Xinyi Zheng, Miao Liu, Per Ola Kristensson, Walterio Mayol-Cuevas, Fan Zhang, Weizhe Lin, Junxiao Shen

Abstract: Long-form egocentric video understanding provides rich contextual information and unique insights into long-term human behaviors, holding significant potential for applications in embodied intelligence, long-term activity analysis, and personalized assistive technologies. However, existing benchmark datasets primarily focus on single, short-duration videos or moderately long videos up to dozens of… ▽ More Long-form egocentric video understanding provides rich contextual information and unique insights into long-term human behaviors, holding significant potential for applications in embodied intelligence, long-term activity analysis, and personalized assistive technologies. However, existing benchmark datasets primarily focus on single, short-duration videos or moderately long videos up to dozens of minutes, leaving a substantial gap in evaluating extensive, ultra-long egocentric video recordings. To address this, we introduce X-LeBench, a novel benchmark dataset specifically crafted for evaluating tasks on extremely long egocentric video recordings. Leveraging the advanced text processing capabilities of large language models (LLMs), X-LeBench develops a life-logging simulation pipeline that produces realistic, coherent daily plans aligned with real-world video data. This approach enables the flexible integration of synthetic daily plans with real-world footage from Ego4D-a massive-scale egocentric video dataset covers a wide range of daily life scenarios-resulting in 432 simulated video life logs that mirror realistic daily activities in contextually rich scenarios. The video life-log durations span from 23 minutes to 16.4 hours. The evaluation of several baseline systems and multimodal large language models (MLLMs) reveals their poor performance across the board, highlighting the inherent challenges of long-form egocentric video understanding and underscoring the need for more advanced models. △ Less

Submitted 12 January, 2025; originally announced January 2025.

arXiv:2501.06689 [pdf, other]

TAPO: Task-Referenced Adaptation for Prompt Optimization

Authors: Wenxin Luo, Weirui Wang, Xiaopeng Li, Weibo Zhou, Pengyue Jia, Xiangyu Zhao

Abstract: Prompt engineering can significantly improve the performance of large language models (LLMs), with automated prompt optimization (APO) gaining significant attention due to the time-consuming and laborious nature of manual prompt design. However, much of the existing work in APO overlooks task-specific characteristics, resulting in prompts that lack domain specificity and are not well-suited for ta… ▽ More Prompt engineering can significantly improve the performance of large language models (LLMs), with automated prompt optimization (APO) gaining significant attention due to the time-consuming and laborious nature of manual prompt design. However, much of the existing work in APO overlooks task-specific characteristics, resulting in prompts that lack domain specificity and are not well-suited for task-specific optimization. In this paper, we introduce TAPO, a multitask-aware prompt optimization framework composed of three key modules. First, a task-aware metric selection module is proposed to enhance task-specific prompt generation capabilities. Second, we present a multi-metrics evaluation module to jointly evaluate prompts from multiple perspectives. Third, an evolution-based optimization framework is introduced for automatic prompt refinement, which improves adaptability across various tasks. Extensive experiments on six datasets demonstrate the effectiveness of our approach, and our code is publicly available. △ Less

Submitted 26 February, 2025; v1 submitted 11 January, 2025; originally announced January 2025.

Comments: Accepted to ICASSP 2025

arXiv:2501.06645 [pdf, other]

FocalPO: Enhancing Preference Optimizing by Focusing on Correct Preference Rankings

Authors: Tong Liu, Xiao Yu, Wenxuan Zhou, Jindong Gu, Volker Tresp

Abstract: Efficient preference optimization algorithms such as Direct Preference Optimization (DPO) have become a popular approach in aligning large language models (LLMs) with human preferences. These algorithms implicitly treat the LLM as a reward model, and focus on training it to correct misranked preference pairs. However, recent work~\citep{chen2024preference} empirically finds that DPO training \text… ▽ More Efficient preference optimization algorithms such as Direct Preference Optimization (DPO) have become a popular approach in aligning large language models (LLMs) with human preferences. These algorithms implicitly treat the LLM as a reward model, and focus on training it to correct misranked preference pairs. However, recent work~\citep{chen2024preference} empirically finds that DPO training \textit{rarely improves these misranked preference pairs}, despite its gradient emphasizing on these cases. We introduce FocalPO, a DPO variant that instead \textit{down-weighs} misranked preference pairs and prioritizes enhancing the model's understanding of pairs that it can already rank correctly. Inspired by Focal Loss used in vision tasks, FocalPO achieves this by adding a modulating factor to dynamically scale DPO loss. Our experiment demonstrates that FocalPO surpasses DPO and its variants on popular benchmarks like Alpaca Eval 2.0 using Mistral-Base-7B and Llama-3-Instruct-8B, with the introduced hyperparameter fixed. Additionally, we empirically reveals how FocalPO affects training on correct and incorrect sample groups, further underscoring its effectiveness. △ Less

Submitted 3 June, 2025; v1 submitted 11 January, 2025; originally announced January 2025.

Comments: ACL 2025

arXiv:2501.06590 [pdf, other]

ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning

Authors: Xiangru Tang, Tianyu Hu, Muyang Ye, Yanjun Shao, Xunjian Yin, Siru Ouyang, Wangchunshu Zhou, Pan Lu, Zhuosheng Zhang, Yilun Zhao, Arman Cohan, Mark Gerstein

Abstract: Chemical reasoning usually involves complex, multi-step processes that demand precise calculations, where even minor errors can lead to cascading failures. Furthermore, large language models (LLMs) encounter difficulties handling domain-specific formulas, executing reasoning steps accurately, and integrating code effectively when tackling chemical reasoning tasks. To address these challenges, we p… ▽ More Chemical reasoning usually involves complex, multi-step processes that demand precise calculations, where even minor errors can lead to cascading failures. Furthermore, large language models (LLMs) encounter difficulties handling domain-specific formulas, executing reasoning steps accurately, and integrating code effectively when tackling chemical reasoning tasks. To address these challenges, we present ChemAgent, a novel framework designed to improve the performance of LLMs through a dynamic, self-updating library. This library is developed by decomposing chemical tasks into sub-tasks and compiling these sub-tasks into a structured collection that can be referenced for future queries. Then, when presented with a new problem, ChemAgent retrieves and refines pertinent information from the library, which we call memory, facilitating effective task decomposition and the generation of solutions. Our method designs three types of memory and a library-enhanced reasoning component, enabling LLMs to improve over time through experience. Experimental results on four chemical reasoning datasets from SciBench demonstrate that ChemAgent achieves performance gains of up to 46% (GPT-4), significantly outperforming existing methods. Our findings suggest substantial potential for future applications, including tasks such as drug discovery and materials science. Our code can be found at https://github.com/gersteinlab/chemagent △ Less

Submitted 11 January, 2025; originally announced January 2025.

arXiv:2501.05107 [pdf]

Harnessing the Power of Vibration Motors to Develop Miniature Untethered Robotic Fishes

Authors: Chongjie Jiang, Yingying Dai, Jinyang Le, Xiaomeng Chen, Yu Xie, Wei Zhou, Fuzhou Niu, Ying Li, Tao Luo

Abstract: Miniature underwater robots play a crucial role in the exploration and development of marine resources, particularly in confined spaces and high-pressure deep-sea environments. This study presents the design, optimization, and performance of a miniature robotic fish, powered by the oscillation of bio-inspired fins. These fins feature a rigid-flexible hybrid structure and use an eccentric rotating… ▽ More Miniature underwater robots play a crucial role in the exploration and development of marine resources, particularly in confined spaces and high-pressure deep-sea environments. This study presents the design, optimization, and performance of a miniature robotic fish, powered by the oscillation of bio-inspired fins. These fins feature a rigid-flexible hybrid structure and use an eccentric rotating mass (ERM) vibration motor as the excitation source to generate high-frequency unidirectional oscillations that induce acoustic streaming for propulsion. The drive mechanism, powered by miniature ERM vibration motors, eliminates the need for complex mechanical drive systems, enabling complete isolation of the entire drive system from the external environment and facilitating the miniaturization of the robotic fish. A compact, untethered robotic fish, measuring 85*60*45 mm^3, is equipped with three bio-inspired fins located at the pectoral and caudal positions. Experimental results demonstrate that the robotic fish achieves a maximum forward swimming speed of 1.36 body lengths (BL) per second powered by all fins and minimum turning radius of 0.6 BL when powered by a single fin. These results underscore the significance of employing the ERM vibration motor in advancing the development of highly maneuverable, miniature untethered underwater robots for various marine exploration tasks. △ Less

Submitted 9 January, 2025; originally announced January 2025.

Comments: 8 pages, 8 figures

arXiv:2501.04945 [pdf, ps, other]

Step-by-Step Mastery: Enhancing Soft Constraint Following Ability of Large Language Models

Authors: Qingyu Ren, Jie Zeng, Qianyu He, Jiaqing Liang, Yanghua Xiao, Weikang Zhou, Zeye Sun, Fei Yu

Abstract: It is crucial for large language models (LLMs) to follow instructions that involve multiple constraints. However, it is an unexplored area to enhance LLMs' ability to follow soft constraints. To bridge the gap, we initially design a pipeline to construct datasets with high-quality outputs automatically. Additionally, to fully utilize the positive and negative samples generated during the data cons… ▽ More It is crucial for large language models (LLMs) to follow instructions that involve multiple constraints. However, it is an unexplored area to enhance LLMs' ability to follow soft constraints. To bridge the gap, we initially design a pipeline to construct datasets with high-quality outputs automatically. Additionally, to fully utilize the positive and negative samples generated during the data construction process, we choose Direct Preference Optimization (DPO) as the training method. Furthermore, taking into account the difficulty of soft constraints indicated by the number of constraints, we design a curriculum learning training paradigm based on the constraint quantity. We experimentally evaluate the effectiveness of our methods in improving LLMs' soft constraint following ability and analyze the factors driving the improvements.The datasets and code are publicly available at https://github.com/Rainier-rq/FollowSoftConstraint. △ Less

Submitted 31 May, 2025; v1 submitted 8 January, 2025; originally announced January 2025.

arXiv:2501.04907 [pdf, other]

Optical skyrmion lattices accelerating in free space

Authors: Haijun Wu, Weijie Zhou, Zhihan Zhu, Yijie Shen

Abstract: Generation and propagation of optical skyrmions provide a versatile plalform for topologically nontrivial optical informatics and light-matter interactions, but their acceleration along curved trajectories is to be studied. In this study, we experimentally demonstrate the first accelerating skyrmion lattices conveyed by Airy structured light, characterized by topologically stable skyrmion textures… ▽ More Generation and propagation of optical skyrmions provide a versatile plalform for topologically nontrivial optical informatics and light-matter interactions, but their acceleration along curved trajectories is to be studied. In this study, we experimentally demonstrate the first accelerating skyrmion lattices conveyed by Airy structured light, characterized by topologically stable skyrmion textures with self-acceleration along parabolic trajectories. We show that the skyrmion unit cell can maintain a Skyrme number $|N_\text{sk}|>0.9$ within a propagation range of $\pm1.22\ z_R$ upon parabolic acceleration. Notably, the meron structure remains $|N_\text{sk}|$ stable within $0.5\pm0.02$ over a significantly extended range of $\pm3.06\ z_R$. Our work provides a new potential carrier for topologically robust information distribution, particle sorting and manipulation. △ Less

Submitted 8 January, 2025; originally announced January 2025.

arXiv:2501.03936 [pdf, other]

PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides

Authors: Hao Zheng, Xinyan Guan, Hao Kong, Jia Zheng, Weixiang Zhou, Hongyu Lin, Yaojie Lu, Ben He, Xianpei Han, Le Sun

Abstract: Automatically generating presentations from documents is a challenging task that requires accommodating content quality, visual appeal, and structural coherence. Existing methods primarily focus on improving and evaluating the content quality in isolation, overlooking visual appeal and structural coherence, which limits their practical applicability. To address these limitations, we propose PPTAge… ▽ More Automatically generating presentations from documents is a challenging task that requires accommodating content quality, visual appeal, and structural coherence. Existing methods primarily focus on improving and evaluating the content quality in isolation, overlooking visual appeal and structural coherence, which limits their practical applicability. To address these limitations, we propose PPTAgent, which comprehensively improves presentation generation through a two-stage, edit-based approach inspired by human workflows. PPTAgent first analyzes reference presentations to extract slide-level functional types and content schemas, then drafts an outline and iteratively generates editing actions based on selected reference slides to create new slides. To comprehensively evaluate the quality of generated presentations, we further introduce PPTEval, an evaluation framework that assesses presentations across three dimensions: Content, Design, and Coherence. Results demonstrate that PPTAgent significantly outperforms existing automatic presentation generation methods across all three dimensions. △ Less

Submitted 21 February, 2025; v1 submitted 7 January, 2025; originally announced January 2025.

Comments: 8 pages, 23 figures, see https://github.com/icip-cas/PPTAgent for details

arXiv:2501.02732 [pdf, other]

AFed: Algorithmic Fair Federated Learning

Authors: Huiqiang Chen, Tianqing Zhu, Wanlei Zhou, Wei Zhao

Abstract: Federated Learning (FL) has gained significant attention as it facilitates collaborative machine learning among multiple clients without centralizing their data on a server. FL ensures the privacy of participating clients by locally storing their data, which creates new challenges in fairness. Traditional debiasing methods assume centralized access to sensitive information, rendering them impracti… ▽ More Federated Learning (FL) has gained significant attention as it facilitates collaborative machine learning among multiple clients without centralizing their data on a server. FL ensures the privacy of participating clients by locally storing their data, which creates new challenges in fairness. Traditional debiasing methods assume centralized access to sensitive information, rendering them impractical for the FL setting. Additionally, FL is more susceptible to fairness issues than centralized machine learning due to the diverse client data sources that may be associated with group information. Therefore, training a fair model in FL without access to client local data is important and challenging. This paper presents AFed, a straightforward yet effective framework for promoting group fairness in FL. The core idea is to circumvent restricted data access by learning the global data distribution. This paper proposes two approaches: AFed-G, which uses a conditional generator trained on the server side, and AFed-GAN, which improves upon AFed-G by training a conditional GAN on the client side. We augment the client data with the generated samples to help remove bias. Our theoretical analysis justifies the proposed methods, and empirical results on multiple real-world datasets demonstrate a substantial improvement in AFed over several baselines. △ Less

Submitted 5 January, 2025; originally announced January 2025.

Comments: Accepted by IEEE Transactions on Neural Networks and Learning Systems

arXiv:2412.20833 [pdf, ps, other]

Inclusion 2024 Global Multimedia Deepfake Detection Challenge: Towards Multi-dimensional Face Forgery Detection

Authors: Yi Zhang, Weize Gao, Changtao Miao, Man Luo, Jianshu Li, Wenzhong Deng, Zhe Li, Bingyu Hu, Weibin Yao, Yunfeng Diao, Wenbo Zhou, Tao Gong, Qi Chu

Abstract: In this paper, we present the Global Multimedia Deepfake Detection held concurrently with the Inclusion 2024. Our Multimedia Deepfake Detection aims to detect automatic image and audio-video manipulations including but not limited to editing, synthesis, generation, Photoshop,etc. Our challenge has attracted 1500 teams from all over the world, with about 5000 valid result submission counts. We invi… ▽ More In this paper, we present the Global Multimedia Deepfake Detection held concurrently with the Inclusion 2024. Our Multimedia Deepfake Detection aims to detect automatic image and audio-video manipulations including but not limited to editing, synthesis, generation, Photoshop,etc. Our challenge has attracted 1500 teams from all over the world, with about 5000 valid result submission counts. We invite the top 20 teams to present their solutions to the challenge, from which the top 3 teams are awarded prizes in the grand finale. In this paper, we present the solutions from the top 3 teams of the two tracks, to boost the research work in the field of image and audio-video forgery detection. The methodologies developed through the challenge will contribute to the development of next-generation deepfake detection systems and we encourage participants to open source their methods. △ Less

Submitted 3 June, 2025; v1 submitted 30 December, 2024; originally announced December 2024.

Comments: Inclusion 2024 Global Multimedia Deepfake Detection Competition Top Team Technical Report

arXiv:2412.20413 [pdf, other]

EraseAnything: Enabling Concept Erasure in Rectified Flow Transformers

Authors: Daiheng Gao, Shilin Lu, Shaw Walters, Wenbo Zhou, Jiaming Chu, Jie Zhang, Bang Zhang, Mengxi Jia, Jian Zhao, Zhaoxin Fan, Weiming Zhang

Abstract: Removing unwanted concepts from large-scale text-to-image (T2I) diffusion models while maintaining their overall generative quality remains an open challenge. This difficulty is especially pronounced in emerging paradigms, such as Stable Diffusion (SD) v3 and Flux, which incorporate flow matching and transformer-based architectures. These advancements limit the transferability of existing concept-… ▽ More Removing unwanted concepts from large-scale text-to-image (T2I) diffusion models while maintaining their overall generative quality remains an open challenge. This difficulty is especially pronounced in emerging paradigms, such as Stable Diffusion (SD) v3 and Flux, which incorporate flow matching and transformer-based architectures. These advancements limit the transferability of existing concept-erasure techniques that were originally designed for the previous T2I paradigm (e.g., SD v1.4). In this work, we introduce EraseAnything, the first method specifically developed to address concept erasure within the latest flow-based T2I framework. We formulate concept erasure as a bi-level optimization problem, employing LoRA-based parameter tuning and an attention map regularizer to selectively suppress undesirable activations. Furthermore, we propose a self-contrastive learning strategy to ensure that removing unwanted concepts does not inadvertently harm performance on unrelated ones. Experimental results demonstrate that EraseAnything successfully fills the research gap left by earlier methods in this new T2I paradigm, achieving state-of-the-art performance across a wide range of concept erasure tasks. △ Less

Submitted 2 January, 2025; v1 submitted 29 December, 2024; originally announced December 2024.

Comments: 24 pages, 18 figures

arXiv:2412.20145 [pdf, other]

Efficient Multi-Agent Collaboration with Tool Use for Online Planning in Complex Table Question Answering

Authors: Wei Zhou, Mohsen Mesgar, Annemarie Friedrich, Heike Adel

Abstract: Complex table question answering (TQA) aims to answer questions that require complex reasoning, such as multi-step or multi-category reasoning, over data represented in tabular form. Previous approaches demonstrated notable performance by leveraging either closed-source large language models (LLMs) or fine-tuned open-weight LLMs. However, fine-tuning LLMs requires high-quality training data, which… ▽ More Complex table question answering (TQA) aims to answer questions that require complex reasoning, such as multi-step or multi-category reasoning, over data represented in tabular form. Previous approaches demonstrated notable performance by leveraging either closed-source large language models (LLMs) or fine-tuned open-weight LLMs. However, fine-tuning LLMs requires high-quality training data, which is costly to obtain, and utilizing closed-source LLMs poses accessibility challenges and leads to reproducibility issues. In this paper, we propose Multi-Agent Collaboration with Tool use (MACT), a framework that requires neither closed-source models nor fine-tuning. In MACT, a planning agent and a coding agent that also make use of tools collaborate to answer questions. Our experiments on four TQA benchmarks show that MACT outperforms previous SoTA systems on three out of four benchmarks and that it performs comparably to the larger and more expensive closed-source model GPT-4 on two benchmarks, even when using only open-weight models without any fine-tuning. We conduct extensive analyses to prove the effectiveness of MACT's multi-agent collaboration in TQA. △ Less

Submitted 8 February, 2025; v1 submitted 28 December, 2024; originally announced December 2024.

Comments: Accepted at NAACL 2025 Findings

arXiv:2412.18933 [pdf, other]

TINQ: Temporal Inconsistency Guided Blind Video Quality Assessment

Authors: Yixiao Li, Xiaoyuan Yang, Weide Liu, Xin Jin, Xu Jia, Yukun Lai, Haotao Liu, Paul L Rosin, Wei Zhou

Abstract: Blind video quality assessment (BVQA) has been actively researched for user-generated content (UGC) videos. Recently, super-resolution (SR) techniques have been widely applied in UGC. Therefore, an effective BVQA method for both UGC and SR scenarios is essential. Temporal inconsistency, referring to irregularities between consecutive frames, is relevant to video quality. Current BVQA approaches ty… ▽ More Blind video quality assessment (BVQA) has been actively researched for user-generated content (UGC) videos. Recently, super-resolution (SR) techniques have been widely applied in UGC. Therefore, an effective BVQA method for both UGC and SR scenarios is essential. Temporal inconsistency, referring to irregularities between consecutive frames, is relevant to video quality. Current BVQA approaches typically model temporal relationships in UGC videos using statistics of motion information, but inconsistencies remain unexplored. Additionally, different from temporal inconsistency in UGC videos, such inconsistency in SR videos is amplified due to upscaling algorithms. In this paper, we introduce the Temporal Inconsistency Guided Blind Video Quality Assessment (TINQ) metric, demonstrating that exploring temporal inconsistency is crucial for effective BVQA. Since temporal inconsistencies vary between UGC and SR videos, they are calculated in different ways. Based on this, a spatial module highlights inconsistent areas across consecutive frames at coarse and fine granularities. In addition, a temporal module aggregates features over time in two stages. The first stage employs a visual memory capacity block to adaptively segment the time dimension based on estimated complexity, while the second stage focuses on selecting key features. The stages work together through Consistency-aware Fusion Units to regress cross-time-scale video quality. Extensive experiments on UGC and SR video quality datasets show that our method outperforms existing state-of-the-art BVQA methods. Code is available at https://github.com/Lighting-YXLI/TINQ. △ Less

Submitted 25 December, 2024; originally announced December 2024.

arXiv:2412.18895 [pdf, other]

Effects of chiral symmetry restoration on dilepton production in heavy ion collisions

Authors: Wen-Hao Zhou, Che Ming Ko, Kai-Jia Sun

Abstract: Because of their weak interactions with the strongly interacting matter produced in relativistic heavy-ion collisions, dileptons provide an ideal probe of the early dynamics of these collisions. Here, we study dilepton production using a partonic transport model that is based on an extended Nambu-Jona-Lasinio (NJL) model. In this model, the in-medium quark masses decrease with increasing temperatu… ▽ More Because of their weak interactions with the strongly interacting matter produced in relativistic heavy-ion collisions, dileptons provide an ideal probe of the early dynamics of these collisions. Here, we study dilepton production using a partonic transport model that is based on an extended Nambu-Jona-Lasinio (NJL) model. In this model, the in-medium quark masses decrease with increasing temperature as a result of the restoration of chiral symmetry. We find that the extracted temperature from dileptons of intermediate masses agrees well with the temperature of the partonic matter, suggesting that dilepton production can be used as a thermometer for the produced partonic matter. Our results also indicate that the extracted in-medium quark masses decrease with increasing dilepton temperature, implying that dilepton production can further serve as a probe of chiral symmetry restoration in high energy heavy-ion collisions. △ Less

Submitted 25 December, 2024; originally announced December 2024.

Comments: 8 pages, 9 figures

arXiv:2412.18132 [pdf, ps, other]

On Tiling and Spectral Sets in $\mathbb Z_{p^2}\times\mathbb Z_{p^2}$

Authors: Weiqi Zhou

Abstract: Let $p$ be a prime number, it is shown that tiling and spectral sets coincide in $\mathbb Z_{p^2}\times\mathbb Z_{p^2}$ by considering equivalently symplectic spectral pairs. Symplectic structures appear naturally in time-frequency analysis and provides a perspective to reveal patterns that may not be so evident in the Euclidean setting. The main approach here is however still to count the size of… ▽ More Let $p$ be a prime number, it is shown that tiling and spectral sets coincide in $\mathbb Z_{p^2}\times\mathbb Z_{p^2}$ by considering equivalently symplectic spectral pairs. Symplectic structures appear naturally in time-frequency analysis and provides a perspective to reveal patterns that may not be so evident in the Euclidean setting. The main approach here is however still to count the size of the zero set and analyze its contents. Some auxiliary results concerning tiling sets and spectral sets of sizes $p$ and $p^{2m-1}$ in $\mathbb Z_{p^m}\times\mathbb Z_{p^m}$ are also presented. △ Less

Submitted 22 February, 2025; v1 submitted 23 December, 2024; originally announced December 2024.

Comments: Expository improvements: The proof of the main theorem is now focused on those difficult cases, other simpler cases are moved to earlier sections. More details and helpful comments are added at various places

MSC Class: 42A99; 05B45

arXiv:2412.17724 [pdf]

Comprehensive Optimization of Interferometric Diffusing Wave Spectroscopy (iDWS)

Authors: Mingjun Zhao, Leah Dickstein, Akshay S. Nadig, Wenjun Zhou, Santosh Aparanji, Hector Garcia Estrada, Shing-Jiuan Liu, Ting Zhou, Weijian Yang, Aaron Lord, Vivek J. Srinivasan

Abstract: It has been shown that light speckle fluctuations provide a means for noninvasive measurements of cerebral blood flow index (CBFi). While conventional Diffuse Correlation Spectroscopy (DCS) provides marginal brain sensitivity for CBFi in adult humans, new techniques have recently emerged to improve diffuse light throughput and thus, brain sensitivity. Here we further optimize one such approach, in… ▽ More It has been shown that light speckle fluctuations provide a means for noninvasive measurements of cerebral blood flow index (CBFi). While conventional Diffuse Correlation Spectroscopy (DCS) provides marginal brain sensitivity for CBFi in adult humans, new techniques have recently emerged to improve diffuse light throughput and thus, brain sensitivity. Here we further optimize one such approach, interferometric diffusing wave spectroscopy (iDWS), with respect to number of independent channels, camera duty cycle and full well capacity, incident power, noise and artifact mitigation, and data processing. We build the system on a cart and define conditions for stable operation. We show pulsatile CBFi monitoring at 4-4.5 cm source-collector separation in adults with moderate pigmentation (Fitzpatrick 4). We also report preliminary clinical measurements in the Neuro Intensive Care Unit (Neuro ICU). These results push the boundaries of iDWS CBFi monitoring performance beyond previous reports. △ Less

Submitted 23 December, 2024; originally announced December 2024.

Comments: 12 pages, 15 figures, 4 tables

arXiv:2412.17632 [pdf, other]

D-Judge: How Far Are We? Evaluating the Discrepancies Between AI-synthesized Images and Natural Images through Multimodal Guidance

Authors: Renyang Liu, Ziyu Lyu, Wei Zhou, See-Kiong Ng

Abstract: In Artificial Intelligence Generated Content (AIGC), distinguishing AI-synthesized images from natural ones remains a key challenge. Despite advancements in generative models, significant discrepancies persist. To systematically investigate and quantify these discrepancies, we introduce an AI-Natural Image Discrepancy accessing benchmark (\textit{D-Judge}) aimed at addressing the critical question… ▽ More In Artificial Intelligence Generated Content (AIGC), distinguishing AI-synthesized images from natural ones remains a key challenge. Despite advancements in generative models, significant discrepancies persist. To systematically investigate and quantify these discrepancies, we introduce an AI-Natural Image Discrepancy accessing benchmark (\textit{D-Judge}) aimed at addressing the critical question: \textit{how far are AI-generated images (AIGIs) from truly realistic images?} We construct \textit{D-ANI}, a dataset with 5,000 natural images and over 440,000 AIGIs generated by nine models using Text-to-Image (T2I), Image-to-Image (I2I), and Text and Image-to-Image (TI2I) prompts. Our framework evaluates the discrepancy across five dimensions: naive image quality, semantic alignment, aesthetic appeal, downstream applicability, and human validation. Results reveal notable gaps, emphasizing the importance of aligning metrics with human judgment. Source code and datasets are available at https://shorturl.at/l83W2. △ Less

Submitted 29 March, 2025; v1 submitted 23 December, 2024; originally announced December 2024.

arXiv:2412.16865 [pdf, ps, other]

Mutual Annihilation of Tiles

Authors: Weiqi Zhou

Abstract: Given $A\subset\mathbb Z_n^2$, the purpose of this article is to investigate when is the difference set $ΔA$ disjoint with the zero set of the Fourier transform of $A$. In the study of tiles in $\mathbb Z_n^2$, the author observed an interesting phenomenon that if $(A,B)$ is a tiling pair with $|A|=|B|$, then sometimes $(A,B)$ is also a spectral pair and vice versa. Moreover, in such cases actuall… ▽ More Given $A\subset\mathbb Z_n^2$, the purpose of this article is to investigate when is the difference set $ΔA$ disjoint with the zero set of the Fourier transform of $A$. In the study of tiles in $\mathbb Z_n^2$, the author observed an interesting phenomenon that if $(A,B)$ is a tiling pair with $|A|=|B|$, then sometimes $(A,B)$ is also a spectral pair and vice versa. Moreover, in such cases actually one of the components would have universality (i.e., it is the universal spectrum/tiling complement for its tiling complements/spectra). It turns out that the disjointness is the critical property here, and shall be analyzed using the symplectic Fourier transform. Under such configuration it is shown that the phenomenon persists either (1) if $A$ is an order $n$ subgroup, or (2) if $n$ is a prime number and $A$ complements an order $n$ subgroup, or (3) if $n=p^2$, and $A$ complements the non-cyclic order $n$ subgroup. A side result binding number of elements in different subgroups is also given. △ Less

Submitted 26 March, 2025; v1 submitted 22 December, 2024; originally announced December 2024.

Comments: Added an example at the end to illustrate the main theorem, removed the computation for n=p case (trivial)

MSC Class: 42A99; 05B45

arXiv:2412.16822 [pdf, other]

Layer- and Timestep-Adaptive Differentiable Token Compression Ratios for Efficient Diffusion Transformers

Authors: Haoran You, Connelly Barnes, Yuqian Zhou, Yan Kang, Zhenbang Du, Wei Zhou, Lingzhi Zhang, Yotam Nitzan, Xiaoyang Liu, Zhe Lin, Eli Shechtman, Sohrab Amirghodsi, Yingyan Celine Lin

Abstract: Diffusion Transformers (DiTs) have achieved state-of-the-art (SOTA) image generation quality but suffer from high latency and memory inefficiency, making them difficult to deploy on resource-constrained devices. One major efficiency bottleneck is that existing DiTs apply equal computation across all regions of an image. However, not all image tokens are equally important, and certain localized are… ▽ More Diffusion Transformers (DiTs) have achieved state-of-the-art (SOTA) image generation quality but suffer from high latency and memory inefficiency, making them difficult to deploy on resource-constrained devices. One major efficiency bottleneck is that existing DiTs apply equal computation across all regions of an image. However, not all image tokens are equally important, and certain localized areas require more computation, such as objects. To address this, we propose DiffCR, a dynamic DiT inference framework with differentiable compression ratios, which automatically learns to dynamically route computation across layers and timesteps for each image token, resulting in efficient DiTs. Specifically, DiffCR integrates three features: (1) A token-level routing scheme where each DiT layer includes a router that is fine-tuned jointly with model weights to predict token importance scores. In this way, unimportant tokens bypass the entire layer's computation; (2) A layer-wise differentiable ratio mechanism where different DiT layers automatically learn varying compression ratios from a zero initialization, resulting in large compression ratios in redundant layers while others remain less compressed or even uncompressed; (3) A timestep-wise differentiable ratio mechanism where each denoising timestep learns its own compression ratio. The resulting pattern shows higher ratios for noisier timesteps and lower ratios as the image becomes clearer. Extensive experiments on text-to-image and inpainting tasks show that DiffCR effectively captures dynamism across token, layer, and timestep axes, achieving superior trade-offs between generation quality and efficiency compared to prior works. The project website is available at https://www.haoranyou.com/diffcr. △ Less

Submitted 27 March, 2025; v1 submitted 21 December, 2024; originally announced December 2024.

Comments: Accepted by CVPR 2025

arXiv:2412.16720 [pdf, other]

OpenAI o1 System Card

Authors: OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondrich , et al. (238 additional authors not shown)

Abstract: The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought. These advanced reasoning capabilities provide new avenues for improving the safety and robustness of our models. In particular, our models can reason about our safety policies in context when responding to potentially unsafe prompts, through deliberative alignment. This leads to state-of-the-ar… ▽ More The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought. These advanced reasoning capabilities provide new avenues for improving the safety and robustness of our models. In particular, our models can reason about our safety policies in context when responding to potentially unsafe prompts, through deliberative alignment. This leads to state-of-the-art performance on certain benchmarks for risks such as generating illicit advice, choosing stereotyped responses, and succumbing to known jailbreaks. Training models to incorporate a chain of thought before answering has the potential to unlock substantial benefits, while also increasing potential risks that stem from heightened intelligence. Our results underscore the need for building robust alignment methods, extensively stress-testing their efficacy, and maintaining meticulous risk management protocols. This report outlines the safety work carried out for the OpenAI o1 and OpenAI o1-mini models, including safety evaluations, external red teaming, and Preparedness Framework evaluations. △ Less

Submitted 21 December, 2024; originally announced December 2024.

arXiv:2412.15957 [pdf, other]

From General to Specific: Tailoring Large Language Models for Personalized Healthcare

Authors: Ruize Shi, Hong Huang, Wei Zhou, Kehan Yin, Kai Zhao, Yun Zhao

Abstract: The rapid development of large language models (LLMs) has transformed many industries, including healthcare. However, previous medical LLMs have largely focused on leveraging general medical knowledge to provide responses, without accounting for patient variability and lacking true personalization at the individual level. To address this, we propose a novel method called personalized medical langu… ▽ More The rapid development of large language models (LLMs) has transformed many industries, including healthcare. However, previous medical LLMs have largely focused on leveraging general medical knowledge to provide responses, without accounting for patient variability and lacking true personalization at the individual level. To address this, we propose a novel method called personalized medical language model (PMLM), which explores and optimizes personalized LLMs through recommendation systems and reinforcement learning (RL). Specifically, by utilizing self-informed and peer-informed personalization, PMLM captures changes in behaviors and preferences to design initial personalized prompts tailored to individual needs. We further refine these initial personalized prompts through RL, ultimately enhancing the precision of LLM guidance. Notably, the personalized prompt are hard prompt, which grants PMLM high adaptability and reusability, allowing it to directly leverage high-quality proprietary LLMs. We evaluate PMLM using real-world obstetrics and gynecology data, and the experimental results demonstrate that PMLM achieves personalized responses, and it provides more refined and individualized services, offering a potential way for personalized medical LLMs. △ Less

Submitted 20 December, 2024; originally announced December 2024.

arXiv:2412.15738 [pdf, other]

Risk spillovers between the BRICS and the U.S. staple grain futures markets

Authors: Ying-Hui Shao, Yan-Hong Yang, Wei-Xing Zhou

Abstract: This study examines contemporaneous and lagged spillover effects in BRICS staple grain futures markets and their linkages with U.S. markets. The results show that contemporaneous spillovers dominate, while net spillovers are driven by lagged connectedness. Systemic risk is lower in intra-BRICS markets compared to those including the U.S., highlighting the U.S. grain market's significant influence.… ▽ More This study examines contemporaneous and lagged spillover effects in BRICS staple grain futures markets and their linkages with U.S. markets. The results show that contemporaneous spillovers dominate, while net spillovers are driven by lagged connectedness. Systemic risk is lower in intra-BRICS markets compared to those including the U.S., highlighting the U.S. grain market's significant influence. Brazilian and U.S. grains are key net spillover contributors, excluding U.S. rice, while South African staple grains act as major net receivers. Particularly, the spillover between soybeans is the strongest. The study also reveals heterogeneous impacts of the Russia-Ukraine conflict and Black Sea Grain Initiative on grain futures. △ Less

Submitted 25 December, 2024; v1 submitted 20 December, 2024; originally announced December 2024.

Comments: 22 pages, 11 figures

arXiv:2412.14528 [pdf, other]

Multi-Level Optimal Transport for Universal Cross-Tokenizer Knowledge Distillation on Language Models

Authors: Xiao Cui, Mo Zhu, Yulei Qin, Liang Xie, Wengang Zhou, Houqiang Li

Abstract: Knowledge distillation (KD) has become a prevalent technique for compressing large language models (LLMs). Existing KD methods are constrained by the need for identical tokenizers (i.e., vocabularies) between teacher and student models, limiting their versatility in handling LLMs of different architecture families. In this paper, we introduce the Multi-Level Optimal Transport (MultiLevelOT), a nov… ▽ More Knowledge distillation (KD) has become a prevalent technique for compressing large language models (LLMs). Existing KD methods are constrained by the need for identical tokenizers (i.e., vocabularies) between teacher and student models, limiting their versatility in handling LLMs of different architecture families. In this paper, we introduce the Multi-Level Optimal Transport (MultiLevelOT), a novel approach that advances the optimal transport for universal cross-tokenizer knowledge distillation. Our method aligns the logit distributions of the teacher and the student at both token and sequence levels using diverse cost matrices, eliminating the need for dimensional or token-by-token correspondence. At the token level, MultiLevelOT integrates both global and local information by jointly optimizing all tokens within a sequence to enhance robustness. At the sequence level, we efficiently capture complex distribution structures of logits via the Sinkhorn distance, which approximates the Wasserstein distance for divergence measures. Extensive experiments on tasks such as extractive QA, generative QA, and summarization demonstrate that the MultiLevelOT outperforms state-of-the-art cross-tokenizer KD methods under various settings. Our approach is robust to different student and teacher models across model families, architectures, and parameter sizes. Codes and models are available at https://github.com/2018cx/Multi-Level-OT. △ Less

Submitted 18 January, 2025; v1 submitted 18 December, 2024; originally announced December 2024.

Comments: Accepted by AAAI 2025 (Oral)

arXiv:2412.13551 [pdf, other]

Large Language Model Federated Learning with Blockchain and Unlearning for Cross-Organizational Collaboration

Authors: Xuhan Zuo, Minghao Wang, Tianqing Zhu, Shui Yu, Wanlei Zhou

Abstract: Large language models (LLMs) have transformed the way computers understand and process human language, but using them effectively across different organizations remains still difficult. When organizations work together to improve LLMs, they face several main challenges. First, organizations hesitate to share their valuable data with others. Second, competition between organizations creates trust p… ▽ More Large language models (LLMs) have transformed the way computers understand and process human language, but using them effectively across different organizations remains still difficult. When organizations work together to improve LLMs, they face several main challenges. First, organizations hesitate to share their valuable data with others. Second, competition between organizations creates trust problems during collaboration. Third, new privacy laws require organizations to be able to delete specific data when requested, which is especially difficult when multiple organizations are learning from shared data. Traditional federated learning approaches do not address these interconnected challenges, particularly in scenarios where participants cannot fully trust each other or the central aggregator. To overcome these limitations, we propose a hybrid blockchain-based federated learning framework that uniquely combines public and private blockchain architectures with multi-agent reinforcement learning. Our framework enables transparent sharing of model update through the public blockchain while protecting sensitive computations in private chains. Each organization operates as an intelligent agent, using Q-learning to optimize its participation strategy and resource allocation, thus aligning individual incentives with collective goals. Notably, we introduce an efficient unlearning mechanism based on Low-Rank Adaptation (LoRA) that enables selective removal of specific data contributions without compromising the model's overall performance. Through extensive experimentation on real-world datasets, we demonstrate that our framework effectively balances privacy protection, trust establishment, and regulatory compliance while maintaining high model performance. △ Less

Submitted 18 December, 2024; originally announced December 2024.

arXiv:2412.13457 [pdf]

doi 10.1103/PhysRevLett.133.256601

Mass Acquisition of Dirac Fermions in Bi4I4 by Spontaneous Symmetry Breaking

Authors: Ming Yang, Wenxuan Zhao, Dan Mu, Zhijian Shi, Jingyuan Zhong, Yaqi Li, Yundan Liu, Jianxin Zhong, Ningyan Cheng, Wei Zhou, Jianfeng Wang, Yan Shi, Ying Sun, Weichang Hao, Lexian Yang, Jincheng Zhuang, Yi Du

Abstract: Massive Dirac fermions, which are essential for realizing novel topological phenomena, are expected to be generated from massless Dirac fermions by breaking the related symmetry, such as time-reversal symmetry (TRS) in topological insulators or crystal symmetry in topological crystalline insulators. Here, we report scanning tunneling microscopy and angle-resolved photoemission spectroscopy studies… ▽ More Massive Dirac fermions, which are essential for realizing novel topological phenomena, are expected to be generated from massless Dirac fermions by breaking the related symmetry, such as time-reversal symmetry (TRS) in topological insulators or crystal symmetry in topological crystalline insulators. Here, we report scanning tunneling microscopy and angle-resolved photoemission spectroscopy studies of α-Bi4I4, which reveals the realization of massive Dirac fermions in the (100) surface states without breaking the TRS. Combined with first-principle calculations, our experimental results indicate that the spontaneous symmetry breaking engenders two nondegenerate edges states at the opposite sides of monolayer Bi4I4 after the structural phase transition, imparting mass to the Dirac fermions after taking the interlayer coupling into account. Our results not only demonstrate the formation of the massive Dirac fermions by spontaneous symmetry breaking, but also imply the potential for the engineering of Dirac fermions for device applications. △ Less

Submitted 17 December, 2024; originally announced December 2024.

Journal ref: Physical Review Letters 133, 256601 (2024)

arXiv:2412.13420 [pdf, other]

BotSim: LLM-Powered Malicious Social Botnet Simulation

Authors: Boyu Qiao, Kun Li, Wei Zhou, Shilong Li, Qianqian Lu, Songlin Hu

Abstract: Social media platforms like X(Twitter) and Reddit are vital to global communication. However, advancements in Large Language Model (LLM) technology give rise to social media bots with unprecedented intelligence. These bots adeptly simulate human profiles, conversations, and interactions, disseminating large amounts of false information and posing significant challenges to platform regulation. To b… ▽ More Social media platforms like X(Twitter) and Reddit are vital to global communication. However, advancements in Large Language Model (LLM) technology give rise to social media bots with unprecedented intelligence. These bots adeptly simulate human profiles, conversations, and interactions, disseminating large amounts of false information and posing significant challenges to platform regulation. To better understand and counter these threats, we innovatively design BotSim, a malicious social botnet simulation powered by LLM. BotSim mimics the information dissemination patterns of real-world social networks, creating a virtual environment composed of intelligent agent bots and real human users. In the temporal simulation constructed by BotSim, these advanced agent bots autonomously engage in social interactions such as posting and commenting, effectively modeling scenarios of information flow and user interaction. Building on the BotSim framework, we construct a highly human-like, LLM-driven bot dataset called BotSim-24 and benchmark multiple bot detection strategies against it. The experimental results indicate that detection methods effective on traditional bot datasets perform worse on BotSim-24, highlighting the urgent need for new detection strategies to address the cybersecurity threats posed by these advanced bots. △ Less

Submitted 17 December, 2024; originally announced December 2024.

arXiv:2412.13103 [pdf, other]

AI PERSONA: Towards Life-long Personalization of LLMs

Authors: Tiannan Wang, Meiling Tao, Ruoyu Fang, Huilin Wang, Shuai Wang, Yuchen Eleanor Jiang, Wangchunshu Zhou

Abstract: In this work, we introduce the task of life-long personalization of large language models. While recent mainstream efforts in the LLM community mainly focus on scaling data and compute for improved capabilities of LLMs, we argue that it is also very important to enable LLM systems, or language agents, to continuously adapt to the diverse and ever-changing profiles of every distinct user and provid… ▽ More In this work, we introduce the task of life-long personalization of large language models. While recent mainstream efforts in the LLM community mainly focus on scaling data and compute for improved capabilities of LLMs, we argue that it is also very important to enable LLM systems, or language agents, to continuously adapt to the diverse and ever-changing profiles of every distinct user and provide up-to-date personalized assistance. We provide a clear task formulation and introduce a simple, general, effective, and scalable framework for life-long personalization of LLM systems and language agents. To facilitate future research on LLM personalization, we also introduce methods to synthesize realistic benchmarks and robust evaluation metrics. We will release all codes and data for building and benchmarking life-long personalized LLM systems. △ Less

Submitted 17 December, 2024; originally announced December 2024.

Comments: Work in progress

arXiv:2412.12888 [pdf, other]

ArtAug: Enhancing Text-to-Image Generation through Synthesis-Understanding Interaction

Authors: Zhongjie Duan, Qianyi Zhao, Cen Chen, Daoyuan Chen, Wenmeng Zhou, Yaliang Li, Yingda Chen

Abstract: The emergence of diffusion models has significantly advanced image synthesis. The recent studies of model interaction and self-corrective reasoning approach in large language models offer new insights for enhancing text-to-image models. Inspired by these studies, we propose a novel method called ArtAug for enhancing text-to-image models in this paper. To the best of our knowledge, ArtAug is the fi… ▽ More The emergence of diffusion models has significantly advanced image synthesis. The recent studies of model interaction and self-corrective reasoning approach in large language models offer new insights for enhancing text-to-image models. Inspired by these studies, we propose a novel method called ArtAug for enhancing text-to-image models in this paper. To the best of our knowledge, ArtAug is the first one that improves image synthesis models via model interactions with understanding models. In the interactions, we leverage human preferences implicitly learned by image understanding models to provide fine-grained suggestions for image synthesis models. The interactions can modify the image content to make it aesthetically pleasing, such as adjusting exposure, changing shooting angles, and adding atmospheric effects. The enhancements brought by the interaction are iteratively fused into the synthesis model itself through an additional enhancement module. This enables the synthesis model to directly produce aesthetically pleasing images without any extra computational cost. In the experiments, we train the ArtAug enhancement module on existing text-to-image models. Various evaluation metrics consistently demonstrate that ArtAug enhances the generative capabilities of text-to-image models without incurring additional computational costs. The source code and models will be released publicly. △ Less

Submitted 18 December, 2024; v1 submitted 17 December, 2024; originally announced December 2024.

Comments: 18 pages, 8 figures

arXiv:2412.12839 [pdf, other]

From An LLM Swarm To A PDDL-Empowered HIVE: Planning Self-Executed Instructions In A Multi-Modal Jungle

Authors: Kaustubh Vyas, Damien Graux, Yijun Yang, Sébastien Montella, Chenxin Diao, Wendi Zhou, Pavlos Vougiouklis, Ruofei Lai, Yang Ren, Keshuang Li, Jeff Z. Pan

Abstract: In response to the call for agent-based solutions that leverage the ever-increasing capabilities of the deep models' ecosystem, we introduce Hive -- a comprehensive solution for selecting appropriate models and subsequently planning a set of atomic actions to satisfy the end-users' instructions. Hive operates over sets of models and, upon receiving natural language instructions (i.e. user queries)… ▽ More In response to the call for agent-based solutions that leverage the ever-increasing capabilities of the deep models' ecosystem, we introduce Hive -- a comprehensive solution for selecting appropriate models and subsequently planning a set of atomic actions to satisfy the end-users' instructions. Hive operates over sets of models and, upon receiving natural language instructions (i.e. user queries), schedules and executes explainable plans of atomic actions. These actions can involve one or more of the available models to achieve the overall task, while respecting end-users specific constraints. Notably, Hive handles tasks that involve multi-modal inputs and outputs, enabling it to handle complex, real-world queries. Our system is capable of planning complex chains of actions while guaranteeing explainability, using an LLM-based formal logic backbone empowered by PDDL operations. We introduce the MuSE benchmark in order to offer a comprehensive evaluation of the multi-modal capabilities of agent systems. Our findings show that our framework redefines the state-of-the-art for task selection, outperforming other competing systems that plan operations across multiple models while offering transparency guarantees while fully adhering to user constraints. △ Less

Submitted 17 December, 2024; originally announced December 2024.

Comments: Under review

arXiv:2412.11476 [pdf, other]

Vertical Federated Unlearning via Backdoor Certification

Authors: Mengde Han, Tianqing Zhu, Lefeng Zhang, Huan Huo, Wanlei Zhou

Abstract: Vertical Federated Learning (VFL) offers a novel paradigm in machine learning, enabling distinct entities to train models cooperatively while maintaining data privacy. This method is particularly pertinent when entities possess datasets with identical sample identifiers but diverse attributes. Recent privacy regulations emphasize an individual's \emph{right to be forgotten}, which necessitates the… ▽ More Vertical Federated Learning (VFL) offers a novel paradigm in machine learning, enabling distinct entities to train models cooperatively while maintaining data privacy. This method is particularly pertinent when entities possess datasets with identical sample identifiers but diverse attributes. Recent privacy regulations emphasize an individual's \emph{right to be forgotten}, which necessitates the ability for models to unlearn specific training data. The primary challenge is to develop a mechanism to eliminate the influence of a specific client from a model without erasing all relevant data from other clients. Our research investigates the removal of a single client's contribution within the VFL framework. We introduce an innovative modification to traditional VFL by employing a mechanism that inverts the typical learning trajectory with the objective of extracting specific data contributions. This approach seeks to optimize model performance using gradient ascent, guided by a pre-defined constrained model. We also introduce a backdoor mechanism to verify the effectiveness of the unlearning procedure. Our method avoids fully accessing the initial training data and avoids storing parameter updates. Empirical evidence shows that the results align closely with those achieved by retraining from scratch. Utilizing gradient ascent, our unlearning approach addresses key challenges in VFL, laying the groundwork for future advancements in this domain. All the code and implementations related to this paper are publicly available at https://github.com/mengde-han/VFL-unlearn. △ Less

Submitted 16 December, 2024; originally announced December 2024.

arXiv:2412.11417 [pdf, other]

RL-LLM-DT: An Automatic Decision Tree Generation Method Based on RL Evaluation and LLM Enhancement

Authors: Junjie Lin, Jian Zhao, Lin Liu, Yue Deng, Youpeng Zhao, Lanxiao Huang, Xia Lin, Wengang Zhou, Houqiang Li

Abstract: Traditionally, AI development for two-player zero-sum games has relied on two primary techniques: decision trees and reinforcement learning (RL). A common approach involves using a fixed decision tree as one player's strategy while training an RL agent as the opponent to identify vulnerabilities in the decision tree, thereby improving its strategic strength iteratively. However, this process often… ▽ More Traditionally, AI development for two-player zero-sum games has relied on two primary techniques: decision trees and reinforcement learning (RL). A common approach involves using a fixed decision tree as one player's strategy while training an RL agent as the opponent to identify vulnerabilities in the decision tree, thereby improving its strategic strength iteratively. However, this process often requires significant human intervention to refine the decision tree after identifying its weaknesses, resulting in inefficiencies and hindering full automation of the strategy enhancement process. Fortunately, the advent of Large Language Models (LLMs) offers a transformative opportunity to automate the process. We propose RL-LLM-DT, an automatic decision tree generation method based on RL Evaluation and LLM Enhancement. Given an initial decision tree, the method involves two important iterative steps. Response Policy Search: RL is used to discover counter-strategies targeting the decision tree. Policy Improvement: LLMs analyze failure scenarios and generate improved decision tree code. In our method, RL focuses on finding the decision tree's flaws while LLM is prompted to generate an improved version of the decision tree. The iterative refinement process terminates when RL can't find any flaw of the tree or LLM fails to improve the tree. To evaluate the effectiveness of this integrated approach, we conducted experiments in a curling game. After iterative refinements, our curling AI based on the decision tree ranks first on the Jidi platform among 34 curling AIs in total, which demonstrates that LLMs can significantly enhance the robustness and adaptability of decision trees, representing a substantial advancement in the field of Game AI. Our code is available at https://github.com/Linjunjie99/RL-LLM-DT. △ Less

Submitted 16 December, 2024; v1 submitted 15 December, 2024; originally announced December 2024.

Comments: Length:10 pages. Figures:10 figures. Additional Notes:In this paper, we have introduced a novel hybrid approach which leverages the strengths of both RL and LLMs to itera- tively refine decision tree tactics, enhancing their performance and adaptability

MSC Class: 68T05 ACM Class: I.2.6; I.2.11

arXiv:2412.09130 [pdf, other]

Exploring the chiral magnetic effect in isobar collisions through Chiral Anomaly Transport

Authors: Zilin Yuan, Anping Huang, Guannan Xie, Wen-Hao Zhou, Guo-Liang Ma, Mei Huang

Abstract: We investigate the signal of the chiral magnetic effect (CME) in Au+Au collisions and isobar collisions of $_{44}^{96}\text{Ru}+\rm{} _{44}^{96}Ru$ and $_{40}^{96}\text{Zr}+\rm{}_{40}^{96}Zr$ in the newly developed chiral anomaly transport (CAT) module based on the state-of-the-art model a multiphase transport (AMPT). Our numerical simulation results for the ratio charge correlation $Δγ$ in Ru+Ru… ▽ More We investigate the signal of the chiral magnetic effect (CME) in Au+Au collisions and isobar collisions of $_{44}^{96}\text{Ru}+\rm{} _{44}^{96}Ru$ and $_{40}^{96}\text{Zr}+\rm{}_{40}^{96}Zr$ in the newly developed chiral anomaly transport (CAT) module based on the state-of-the-art model a multiphase transport (AMPT). Our numerical simulation results for the ratio charge correlation $Δγ$ in Ru+Ru and Zr+Zr collisions are close to the latest experimental data. The simulation shows that the CME signal is larger in Ru+Ru collisions than that in Zr+Zr collisions, while the background is smaller, and the upper limit of the CME signal is $15\%$ in the isobar collisions. △ Less

Submitted 12 December, 2024; originally announced December 2024.

Comments: 13 pages, 15 figures

arXiv:2412.08898 [pdf, ps, other]

Updated version "Robust Voltage Regulation of DC-DC Buck Converter With ZIP Load via An Energy Shaping Control Approach"

Authors: Wei He, Yanqin Zhang, Yukai Shang, Mohammad Masoud Namazi, Wangping Zhou, Josep M. Guerrero

Abstract: ZIP loads (the parallel combination of constant impedance loads, constant current loads and constant power loads) exist widely in power system. In order to stabilize buck converter based DC distributed system with ZIP load, an adaptive energy shaping controller (AESC) is devised in this paper. Firstly, based on the assumption that lumped disturbances are known, a full information controller is des… ▽ More ZIP loads (the parallel combination of constant impedance loads, constant current loads and constant power loads) exist widely in power system. In order to stabilize buck converter based DC distributed system with ZIP load, an adaptive energy shaping controller (AESC) is devised in this paper. Firstly, based on the assumption that lumped disturbances are known, a full information controller is designed in the framework of the port Hamiltonian system via energy shaping technique. Besides, using mathematical deductive method, an estimation of the domain of attraction is given to ensure the strict stability. Furthermore, to eliminate the influence of parameter perturbations on the system, a disturbance observer is proposed to reconstruct the lumped disturbances and then the estimated terms are introduced to above controller to form an AESC scheme. In addition, the stability analysis of the closed-loop system is given. Lastly, the simulation and experiment results are presented for assessing the designed controller. △ Less

Submitted 11 December, 2024; originally announced December 2024.

arXiv:2412.08082 [pdf, other]

FaceTracer: Unveiling Source Identities from Swapped Face Images and Videos for Fraud Prevention

Authors: Zhongyi Zhang, Jie Zhang, Wenbo Zhou, Xinghui Zhou, Qing Guo, Weiming Zhang, Tianwei Zhang, Nenghai Yu

Abstract: Face-swapping techniques have advanced rapidly with the evolution of deep learning, leading to widespread use and growing concerns about potential misuse, especially in cases of fraud. While many efforts have focused on detecting swapped face images or videos, these methods are insufficient for tracing the malicious users behind fraudulent activities. Intrusive watermark-based approaches also fail… ▽ More Face-swapping techniques have advanced rapidly with the evolution of deep learning, leading to widespread use and growing concerns about potential misuse, especially in cases of fraud. While many efforts have focused on detecting swapped face images or videos, these methods are insufficient for tracing the malicious users behind fraudulent activities. Intrusive watermark-based approaches also fail to trace unmarked identities, limiting their practical utility. To address these challenges, we introduce FaceTracer, the first non-intrusive framework specifically designed to trace the identity of the source person from swapped face images or videos. Specifically, FaceTracer leverages a disentanglement module that effectively suppresses identity information related to the target person while isolating the identity features of the source person. This allows us to extract robust identity information that can directly link the swapped face back to the original individual, aiding in uncovering the actors behind fraudulent activities. Extensive experiments demonstrate FaceTracer's effectiveness across various face-swapping techniques, successfully identifying the source person in swapped content and enabling the tracing of malicious actors involved in fraudulent activities. Additionally, FaceTracer shows strong transferability to unseen face-swapping methods including commercial applications and robustness against transmission distortions and adaptive attacks. △ Less

Submitted 10 December, 2024; originally announced December 2024.

Comments: 17 pages, 18 figures, under review

arXiv:2412.07575 [pdf, other]

Defending Against Neural Network Model Inversion Attacks via Data Poisoning

Authors: Shuai Zhou, Dayong Ye, Tianqing Zhu, Wanlei Zhou

Abstract: Model inversion attacks pose a significant privacy threat to machine learning models by reconstructing sensitive data from their outputs. While various defenses have been proposed to counteract these attacks, they often come at the cost of the classifier's utility, thus creating a challenging trade-off between privacy protection and model utility. Moreover, most existing defenses require retrainin… ▽ More Model inversion attacks pose a significant privacy threat to machine learning models by reconstructing sensitive data from their outputs. While various defenses have been proposed to counteract these attacks, they often come at the cost of the classifier's utility, thus creating a challenging trade-off between privacy protection and model utility. Moreover, most existing defenses require retraining the classifier for enhanced robustness, which is impractical for large-scale, well-established models. This paper introduces a novel defense mechanism to better balance privacy and utility, particularly against adversaries who employ a machine learning model (i.e., inversion model) to reconstruct private data. Drawing inspiration from data poisoning attacks, which can compromise the performance of machine learning models, we propose a strategy that leverages data poisoning to contaminate the training data of inversion models, thereby preventing model inversion attacks. Two defense methods are presented. The first, termed label-preserving poisoning attacks for all output vectors (LPA), involves subtle perturbations to all output vectors while preserving their labels. Our findings demonstrate that these minor perturbations, introduced through a data poisoning approach, significantly increase the difficulty of data reconstruction without compromising the utility of the classifier. Subsequently, we introduce a second method, label-flipping poisoning for partial output vectors (LFP), which selectively perturbs a small subset of output vectors and alters their labels during the process. Empirical results indicate that LPA is notably effective, outperforming the current state-of-the-art defenses. Our data poisoning-based defense provides a new retraining-free defense paradigm that preserves the victim classifier's utility. △ Less

Submitted 10 December, 2024; originally announced December 2024.

arXiv:2412.06521 [pdf]

Ancient DNA from 120-Million-Year-Old Lycoptera Fossils Reveals Evolutionary Insights

Authors: Wan-Qian Zhao, Zhan-Yong Guo, Zeng-Yuan Tian, Tong-Fu Su, Gang-Qiang Cao, Zi-Xin Qi, Tian-Cang Qin, Wei Zhou, Jin-Yu Yang, Ming-Jie Chen, Xin-Ge Zhang, Chun-Yan Zhou, Chuan-Jia Zhu, Meng-Fei Tang, Di Wu, Mei-Rong Song, Yu-Qi Guo, Li-You Qiu, Fei Liang, Mei-Jun Li, Jun-Hui Geng, Li-Juan Zhao, Shu-Jie Zhang

Abstract: High quality ancient DNA (aDNA) is essential for molecular paleontology. Due to DNA degradation and contamination by environmental DNA (eDNA), current research is limited to fossils less than 1 million years old. The study successfully extracted DNA from Lycoptera davidi fossils from the Early Cretaceous period, dating 120 million years ago. Using high-throughput sequencing, 1,258,901 DNA sequence… ▽ More High quality ancient DNA (aDNA) is essential for molecular paleontology. Due to DNA degradation and contamination by environmental DNA (eDNA), current research is limited to fossils less than 1 million years old. The study successfully extracted DNA from Lycoptera davidi fossils from the Early Cretaceous period, dating 120 million years ago. Using high-throughput sequencing, 1,258,901 DNA sequences were obtained. We established a rigorous protocol known as the mega screen method. Using this method, we identified 243 original in situ DNA (oriDNA) sequences, likely from the Lycoptera genome. These sequences have an average length of over 100 base pairs and show no signs of deamination. Additionally, 10 transposase coding sequences were discovered, shedding light on a unique self-renewal mechanism in the genome. This study provides valuable DNA data for understanding ancient fish evolution and advances paleontological research. △ Less

Submitted 9 December, 2024; originally announced December 2024.

Comments: 14 pages,3 Figures

arXiv:2412.05830 [pdf, other]

Large Language Models Merging for Enhancing the Link Stealing Attack on Graph Neural Networks

Authors: Faqian Guan, Tianqing Zhu, Wenhan Chang, Wei Ren, Wanlei Zhou

Abstract: Graph Neural Networks (GNNs), specifically designed to process the graph data, have achieved remarkable success in various applications. Link stealing attacks on graph data pose a significant privacy threat, as attackers aim to extract sensitive relationships between nodes (entities), potentially leading to academic misconduct, fraudulent transactions, or other malicious activities. Previous studi… ▽ More Graph Neural Networks (GNNs), specifically designed to process the graph data, have achieved remarkable success in various applications. Link stealing attacks on graph data pose a significant privacy threat, as attackers aim to extract sensitive relationships between nodes (entities), potentially leading to academic misconduct, fraudulent transactions, or other malicious activities. Previous studies have primarily focused on single datasets and did not explore cross-dataset attacks, let alone attacks that leverage the combined knowledge of multiple attackers. However, we find that an attacker can combine the data knowledge of multiple attackers to create a more effective attack model, which can be referred to cross-dataset attacks. Moreover, if knowledge can be extracted with the help of Large Language Models (LLMs), the attack capability will be more significant. In this paper, we propose a novel link stealing attack method that takes advantage of cross-dataset and Large Language Models (LLMs). The LLM is applied to process datasets with different data structures in cross-dataset attacks. Each attacker fine-tunes the LLM on their specific dataset to generate a tailored attack model. We then introduce a novel model merging method to integrate the parameters of these attacker-specific models effectively. The result is a merged attack model with superior generalization capabilities, enabling effective attacks not only on the attackers' datasets but also on previously unseen (out-of-domain) datasets. We conducted extensive experiments in four datasets to demonstrate the effectiveness of our method. Additional experiments with three different GNN and LLM architectures further illustrate the generality of our approach. △ Less

Submitted 8 December, 2024; originally announced December 2024.

Comments: Link Stealing Attacks, Large Language Models, Graph Neural Networks, Privacy Attacks, Model Merging

arXiv:2412.04606 [pdf, other]

Semantic Consistency-Based Uncertainty Quantification for Factuality in Radiology Report Generation

Authors: Chenyu Wang, Weichao Zhou, Shantanu Ghosh, Kayhan Batmanghelich, Wenchao Li

Abstract: Radiology report generation (RRG) has shown great potential in assisting radiologists by automating the labor-intensive task of report writing. While recent advancements have improved the quality and coherence of generated reports, ensuring their factual correctness remains a critical challenge. Although generative medical Vision Large Language Models (VLLMs) have been proposed to address this iss… ▽ More Radiology report generation (RRG) has shown great potential in assisting radiologists by automating the labor-intensive task of report writing. While recent advancements have improved the quality and coherence of generated reports, ensuring their factual correctness remains a critical challenge. Although generative medical Vision Large Language Models (VLLMs) have been proposed to address this issue, these models are prone to hallucinations and can produce inaccurate diagnostic information. To address these concerns, we introduce a novel Semantic Consistency-Based Uncertainty Quantification framework that provides both report-level and sentence-level uncertainties. Unlike existing approaches, our method does not require modifications to the underlying model or access to its inner state, such as output token logits, thus serving as a plug-and-play module that can be seamlessly integrated with state-of-the-art models. Extensive experiments demonstrate the efficacy of our method in detecting hallucinations and enhancing the factual accuracy of automatically generated radiology reports. By abstaining from high-uncertainty reports, our approach improves factuality scores by $10$\%, achieved by rejecting $20$\% of reports using the \texttt{Radialog} model on the MIMIC-CXR dataset. Furthermore, sentence-level uncertainty flags the lowest-precision sentence in each report with an $82.9$\% success rate. Our implementation is open-source and available at https://github.com/BU-DEPEND-Lab/SCUQ-RRG. △ Less

Submitted 16 March, 2025; v1 submitted 5 December, 2024; originally announced December 2024.

arXiv:2412.03148 [pdf, other]

Fine-Grained Behavior Simulation with Role-Playing Large Language Model on Social Media

Authors: Kun Li, Chenwei Dai, Wei Zhou, Songlin Hu

Abstract: Large language models (LLMs) have demonstrated impressive capabilities in role-playing tasks. However, there is limited research on whether LLMs can accurately simulate user behavior in real-world scenarios, such as social media. This requires models to effectively analyze a user's history and simulate their role. In this paper, we introduce \textbf{FineRob}, a novel fine-grained behavior simulati… ▽ More Large language models (LLMs) have demonstrated impressive capabilities in role-playing tasks. However, there is limited research on whether LLMs can accurately simulate user behavior in real-world scenarios, such as social media. This requires models to effectively analyze a user's history and simulate their role. In this paper, we introduce \textbf{FineRob}, a novel fine-grained behavior simulation dataset. We collect the complete behavioral history of 1,866 distinct users across three social media platforms. Each behavior is decomposed into three fine-grained elements: object, type, and content, resulting in 78.6k QA records. Based on FineRob, we identify two dominant reasoning patterns in LLMs' behavior simulation processes and propose the \textbf{OM-CoT} fine-tuning method to enhance the capability. Through comprehensive experiments, we conduct an in-depth analysis of key factors of behavior simulation and also demonstrate the effectiveness of OM-CoT approach\footnote{Code and dataset are available at \url{https://github.com/linkseed18612254945/FineRob}} △ Less

Submitted 4 December, 2024; originally announced December 2024.

arXiv:2412.02721 [pdf, other]

Advancing Tritium Self-Sufficiency in Fusion Power Plants: Insights from the BABY Experiment

Authors: Remi Delaporte-Mathurin, Nikola Goles, John Ball, Collin Dunn, Emily Edwards, Sara Ferry, Edward Lamere, Andrew Lanzrath, Rick Leccacorvi, Samuele Meschini, Ethan Peterson, Stefano Segantin, Rui Vieira, Dennis Whyte, Weiyue Zhou, Kevin Woller

Abstract: In the pursuit of fusion power, achieving tritium self-sufficiency stands as a pivotal challenge. Tritium breeding within molten salts is a critical aspect of next-generation fusion reactors, yet experimental measurements of \gls{tbr} have remained elusive. Here we present the results of the \gls{baby} experiment, which represents a pioneering effort in tritium research by utilizing high-energy (\… ▽ More In the pursuit of fusion power, achieving tritium self-sufficiency stands as a pivotal challenge. Tritium breeding within molten salts is a critical aspect of next-generation fusion reactors, yet experimental measurements of \gls{tbr} have remained elusive. Here we present the results of the \gls{baby} experiment, which represents a pioneering effort in tritium research by utilizing high-energy (\SI{14}{\mega\electronvolt}) neutron irradiation of molten salts, a departure from conventional low-energy neutron approaches. Using a small-scale (\SI{100}{\milli\litre}) molten salt tritium breeding setup, we not only simulated, but also directly measured a \gls{tbr}. This innovative approach provides crucial experimental validation, offering insights unattainable through simulation alone. Moreover, our findings reveal a surprising outcome: tritium was predominantly collected as HT, contrary to the expected TF. This underscores the complexity of tritium behavior in molten salts, highlighting the need for further investigation. This work lays the foundation for a more sophisticated experimental setup, including increasing the volume of the breeder, enhancing neutron detection, and refining tritium collection systems. Such improvements are crucial for advancing our understanding of fusion reactor feasibility and paving the way for future experiments. △ Less

Submitted 2 December, 2024; originally announced December 2024.

arXiv:2412.02685 [pdf, other]

T-REG: Preference Optimization with Token-Level Reward Regularization

Authors: Wenxuan Zhou, Shujian Zhang, Lingxiao Zhao, Tao Meng

Abstract: Reinforcement learning from human feedback (RLHF) has been crucial in aligning large language models (LLMs) with human values. Traditionally, RLHF involves generating responses to a query and using a reward model to assign a reward to the entire response. However, this approach faces challenges due to its reliance on a single, sparse reward, which makes it challenging for the model to identify whi… ▽ More Reinforcement learning from human feedback (RLHF) has been crucial in aligning large language models (LLMs) with human values. Traditionally, RLHF involves generating responses to a query and using a reward model to assign a reward to the entire response. However, this approach faces challenges due to its reliance on a single, sparse reward, which makes it challenging for the model to identify which parts of the sequence contribute most significantly to the final reward. Recent methods have attempted to address this limitation by introducing token-level rewards. However, these methods often rely on either a trained credit assignment model or AI annotators, raising concerns about the quality and reliability of the rewards. In this paper, we propose token-level reward regularization (T-REG), a novel approach that leverages both sequence-level and token-level rewards for preference optimization. Harnessing the self-refinement capabilities of LLMs, our method uses contrastive prompting to enable LLMs to self-generate token-level rewards. These self-generated rewards then act as reward regularization, guiding the model to more effectively distribute sequence-level rewards across tokens. This facilitates better token-level credit assignment and enhances alignment performance. Experiments on the instruction following benchmarks, including Alpaca Eval 2 and Arena-Hard, show that our method consistently outperforms baseline methods by up to 3.8% and 4.4%, respectively. We will release the code and models at https://github.com/wzhouad/T-REG. △ Less

Submitted 3 December, 2024; originally announced December 2024.

arXiv:2412.00348 [pdf, other]

Vision Technologies with Applications in Traffic Surveillance Systems: A Holistic Survey

Authors: Wei Zhou, Lei Zhao, Runyu Zhang, Yifan Cui, Hongpu Huang, Kun Qie, Chen Wang

Abstract: Traffic Surveillance Systems (TSS) have become increasingly crucial in modern intelligent transportation systems, with vision-based technologies playing a central role for scene perception and understanding. While existing surveys typically focus on isolated aspects of TSS, a comprehensive analysis bridging low-level and high-level perception tasks, particularly considering emerging technologies,… ▽ More Traffic Surveillance Systems (TSS) have become increasingly crucial in modern intelligent transportation systems, with vision-based technologies playing a central role for scene perception and understanding. While existing surveys typically focus on isolated aspects of TSS, a comprehensive analysis bridging low-level and high-level perception tasks, particularly considering emerging technologies, remains lacking. This paper presents a systematic review of vision-based technologies in TSS, examining both low-level perception tasks (object detection, classification, and tracking) and high-level perception applications (parameter estimation, anomaly detection, and behavior understanding). Specifically, we first provide a detailed methodological categorization and comprehensive performance evaluation for each task. Our investigation reveals five fundamental limitations in current TSS: perceptual data degradation in complex scenarios, data-driven learning constraints, semantic understanding gaps, sensing coverage limitations and computational resource demands. To address these challenges, we systematically analyze five categories of potential solutions: advanced perception enhancement, efficient learning paradigms, knowledge-enhanced understanding, cooperative sensing frameworks and efficient computing frameworks. Furthermore, we evaluate the transformative potential of foundation models in TSS, demonstrating their unique capabilities in zero-shot learning, semantic understanding, and scene generation. This review provides a unified framework bridging low-level and high-level perception tasks, systematically analyzes current limitations and solutions, and presents a structured roadmap for integrating emerging technologies, particularly foundation models, to enhance TSS capabilities. △ Less

Submitted 29 November, 2024; originally announced December 2024.

arXiv:2411.19445 [pdf]

Achromatic single-layer hologram

Authors: Zhi Li, Wenhui Zhou, Xin Yuan, Weiwei Cai, Dongdong Teng, Qiang Song, Huigao Duan

Abstract: Phase retrieval is a fundamental technique of advanced optical technologies, enabling precise control over wavefront properties. A persistent challenge in diffractive optical element (DOE) design is that a single hologram typically operates within a single wavelength or color channel, limiting it to monochromatic image generation. This limitation in channel capacity significantly restricts the app… ▽ More Phase retrieval is a fundamental technique of advanced optical technologies, enabling precise control over wavefront properties. A persistent challenge in diffractive optical element (DOE) design is that a single hologram typically operates within a single wavelength or color channel, limiting it to monochromatic image generation. This limitation in channel capacity significantly restricts the applicability of DOE in optical applications. In this study, we propose a design strategy for full-color, single-layer hologram based on a variable-scale diffraction model. By imposing strict constraints in Fourier domain and reducing depth of focus (DOF), we achieve the simultaneous encryption and storage of red, green, and blue channel information within a single achromatic hologram. This strategy facilitates color separation in large-depth 3D holography and enables achromatic full-color image displays. We demonstrated full-color holographic video playback at a full refresh rate of 60 Hz, achieving a temporal resolution three times greater than that of existing methods. Furthermore, we successfully fabricated achromatic, twin-image-free, full-color binary pure-phase DOEs at low cost. This achromatic strategy addresses the demands across various fields in optics, including high-refresh-rate full-color displays, high-density optical information storage, advanced optical security, high-reusability holographic metasurface optical element, and high-performance achromatic metalenses. △ Less

Submitted 28 November, 2024; originally announced November 2024.

arXiv:2411.19194 [pdf]

Influencing Factors of the FLASH Effect: Unveiling the Importance of Free Radicals

Authors: Yan Zhang, Chenyang Huang, Ankang Hu, Yucheng Wang, Wanyi Zhou, Jiaqi Qiu, Jian Wang, Qibin Fu, Tuchen Huang, Hao Zha, Wei Wang, Xiaowu Deng, Junli Li

Abstract: Purpose: Our aim was to elucidate the critical factors responsible for inducing the FLASH effect, focusing on the role of free radicals through simulation and experimental approaches. Methods and Materials: The whole abdomen of C57BL/6 mice was irradiated with 6 MeV electron beam. The endpoint was acute intestinal toxicity quantified by histological score. Total doses ranging from 6 to 15 Gy were… ▽ More Purpose: Our aim was to elucidate the critical factors responsible for inducing the FLASH effect, focusing on the role of free radicals through simulation and experimental approaches. Methods and Materials: The whole abdomen of C57BL/6 mice was irradiated with 6 MeV electron beam. The endpoint was acute intestinal toxicity quantified by histological score. Total doses ranging from 6 to 15 Gy were evaluated. The impact of the mean dose rate (MDR) was assessed in the range of 40 to 900 Gy/s. Dose per pulse (DPP) of 0.5 Gy and 3 Gy were compared. The recombination of peroxyl radicals were simulated. Further comparisons were conducted by incorporating the antioxidant amifostine. Results: When varying total doses with a constant MDR of 900 Gy/s, the FLASH effect was not observed until the dose reached 15 Gy. For a total dose of 15 Gy and varying MDR, the FLASH effect was observed only when MDR reached 100 Gy/s. For a dose of 15 Gy and an MDR of 150 Gy/s, no significant difference in biological effect was observed between low DPP and high DPP. The simulation results indicated that the fraction of peroxyl radicals recombination remained nearly zero at conventional dose rates. For FLASH irradiation, the recombination fraction increased linearly with the dose. Notably, the dose delivery time corresponding to 50% change in the recombination fraction was approximately 300 ms. The addition of amifostine effectively eliminated the difference between FLASH group and CONV group. Conclusions: The critical requirement for observing the sparing effect at the biological endpoint is the administration of an adequate dose within the time window of the radical reaction. Additionally, the important role of free radical was verified after introducing antioxidants, suggesting that the generation and recombination of free radicals are pivotal factors influencing the FLASH sparing effect. △ Less

Submitted 28 November, 2024; originally announced November 2024.

Comments: 15 pages, 4 figures, 1 table

arXiv:2411.19062 [pdf]

Unveiling the anisotropy of linear and nonlinear charge-spin conversion in Weyl semimetal TaIrTe4

Authors: Tao Tang, Mengzhou Li, Bin Lao, Xuan Zheng, Wei Zhou, Xiaofeng Xu, Jie Pang, You-guo Shi, Run-Wei Li, Zhiming Wang

Abstract: In Weyl semimetals, the nonlinear planar Hall effect (NPHE) and spin-orbit torque (SOT) are prominent manifestations of nonlinear and linear charge-spin conversion, respectively. However, simultaneous investigations of these phenomena within a single material system are scarce, limiting our understanding of their intrinsic connection and underlying mechanisms. Here, we report the first simultaneou… ▽ More In Weyl semimetals, the nonlinear planar Hall effect (NPHE) and spin-orbit torque (SOT) are prominent manifestations of nonlinear and linear charge-spin conversion, respectively. However, simultaneous investigations of these phenomena within a single material system are scarce, limiting our understanding of their intrinsic connection and underlying mechanisms. Here, we report the first simultaneous observation of NPHE and SOT in a TaIrTe4/Py heterostructure. By employing harmonic Hall measurements and developing a magnetic field-dependent method, we successfully separated the contributions from NPHE, field-like SOT, and damping-like SOT, enabling accurate characterization of both linear and nonlinear charge-spin conversion properties. Our experiments revealed significant anisotropy along the [100] and [010] crystallographic directions of TaIrTe4, with stronger nonlinear responses and field-like SOT along the [100] direction, and larger damping-like SOT along the [010] direction. The distinct directional dependence of these phenomena provides new insights into the interplay between surface and bulk contributions to charge-spin conversion in Weyl semimetals. These findings enhance our understanding of anisotropic charge-spin conversion mechanisms in Weyl semimetals, which may inform future research and development of spintronic devices based on topological materials. △ Less

Submitted 28 November, 2024; originally announced November 2024.

arXiv:2411.19040 [pdf, other]

doi 10.1051/0004-6361/202452146

The binary Yarkovsky effect on the primary asteroid with applications to singly synchronous binary asteroids

Authors: Wen-Han Zhou

Abstract: The binary Yarkovsky effect on the secondary asteroid (BYS) was recently discovered to influence binary asteroid systems by pushing the secondary asteroid toward a synchronous orbit on a short timescale. However, the binary Yarkovsky effect on the primary (BYP) remains less understood, partly due to non-linear effects from partial eclipses, but could have significant implications for singly synchr… ▽ More The binary Yarkovsky effect on the secondary asteroid (BYS) was recently discovered to influence binary asteroid systems by pushing the secondary asteroid toward a synchronous orbit on a short timescale. However, the binary Yarkovsky effect on the primary (BYP) remains less understood, partly due to non-linear effects from partial eclipses, but could have significant implications for singly synchronous binaries. In this work, we studied the BYP effect by numerical methods and estimated its induced orbital drifting rates for real binary asteroids. We find an empirical modified solution to estimate the effective BYP: the traditional BYP formula multiplied by $(r_s / r_p)^(α-1 )$. We confirm that the BYP pushes the primary towards a synchronous orbit where its spin equals the mean motion. The parameter $α$ is insensitive to the ratio of the spin rate to the mean motion and decreases slightly with increasing thermal inertia. For small binary systems with a typical thermal inertia of 200 tiu, $α$ is approximately 1.7. The BYP is found to affect the mutual orbit of singly synchronous binaries with a timescale typically an order of magnitude longer than that of the BYS. Drift rates induced by the BYP for known small binary asteroids (primary radius < 1 km) range from -0.001 to -1 cm $yr^{-1}$. A comparative analysis with observed orbital drift rates shows agreement for pre-impact Didymos and 1996 FG$_3$ but discrepancies for 2001 SL$_9$ and 1999 KW$_4$, suggesting complex dynamics in these systems involving the BYP, the binary Yarkovsky-O'Keefe-Radzievskii-Paddack (BYORP) effect, and tides. The BYP is changing the mutual orbits of most discovered binary asteroids. We suggest that the BYP should be considered along with BYORP and tidal effects when studying binary systems' long-term dynamics. △ Less

Submitted 28 November, 2024; originally announced November 2024.

Comments: 7 pages, 5 figures. Published in A&A Letters

arXiv:2411.18197 [pdf, other]

Make-It-Animatable: An Efficient Framework for Authoring Animation-Ready 3D Characters

Authors: Zhiyang Guo, Jinxu Xiang, Kai Ma, Wengang Zhou, Houqiang Li, Ran Zhang

Abstract: 3D characters are essential to modern creative industries, but making them animatable often demands extensive manual work in tasks like rigging and skinning. Existing automatic rigging tools face several limitations, including the necessity for manual annotations, rigid skeleton topologies, and limited generalization across diverse shapes and poses. An alternative approach is to generate animatabl… ▽ More 3D characters are essential to modern creative industries, but making them animatable often demands extensive manual work in tasks like rigging and skinning. Existing automatic rigging tools face several limitations, including the necessity for manual annotations, rigid skeleton topologies, and limited generalization across diverse shapes and poses. An alternative approach is to generate animatable avatars pre-bound to a rigged template mesh. However, this method often lacks flexibility and is typically limited to realistic human shapes. To address these issues, we present Make-It-Animatable, a novel data-driven method to make any 3D humanoid model ready for character animation in less than one second, regardless of its shapes and poses. Our unified framework generates high-quality blend weights, bones, and pose transformations. By incorporating a particle-based shape autoencoder, our approach supports various 3D representations, including meshes and 3D Gaussian splats. Additionally, we employ a coarse-to-fine representation and a structure-aware modeling strategy to ensure both accuracy and robustness, even for characters with non-standard skeleton structures. We conducted extensive experiments to validate our framework's effectiveness. Compared to existing methods, our approach demonstrates significant improvements in both quality and speed. More demos and code are available at https://jasongzy.github.io/Make-It-Animatable/. △ Less

Submitted 11 March, 2025; v1 submitted 27 November, 2024; originally announced November 2024.

Comments: CVPR 2025. Project page: https://jasongzy.github.io/Make-It-Animatable/

arXiv:2411.15714 [pdf, other]

ROOT: VLM based System for Indoor Scene Understanding and Beyond

Authors: Yonghui Wang, Shi-Yong Chen, Zhenxing Zhou, Siyi Li, Haoran Li, Wengang Zhou, Houqiang Li

Abstract: Recently, Vision Language Models (VLMs) have experienced significant advancements, yet these models still face challenges in spatial hierarchical reasoning within indoor scenes. In this study, we introduce ROOT, a VLM-based system designed to enhance the analysis of indoor scenes. Specifically, we first develop an iterative object perception algorithm using GPT-4V to detect object entities within… ▽ More Recently, Vision Language Models (VLMs) have experienced significant advancements, yet these models still face challenges in spatial hierarchical reasoning within indoor scenes. In this study, we introduce ROOT, a VLM-based system designed to enhance the analysis of indoor scenes. Specifically, we first develop an iterative object perception algorithm using GPT-4V to detect object entities within indoor scenes. This is followed by employing vision foundation models to acquire additional meta-information about the scene, such as bounding boxes. Building on this foundational data, we propose a specialized VLM, SceneVLM, which is capable of generating spatial hierarchical scene graphs and providing distance information for objects within indoor environments. This information enhances our understanding of the spatial arrangement of indoor scenes. To train our SceneVLM, we collect over 610,000 images from various public indoor datasets and implement a scene data generation pipeline with a semi-automated technique to establish relationships and estimate distances among indoor objects. By utilizing this enriched data, we conduct various training recipes and finish SceneVLM. Our experiments demonstrate that \rootname facilitates indoor scene understanding and proves effective in diverse downstream applications, such as 3D scene generation and embodied AI. The code will be released at \url{https://github.com/harrytea/ROOT}. △ Less

Submitted 23 November, 2024; originally announced November 2024.

Showing 201–250 of 1,936 results for author: Zhou, W