Search | arXiv e-print repository

EchoGen: Generating Visual Echoes in Any Scene via Feed-Forward Subject-Driven Auto-Regressive Model

Authors: Ruixiao Dong, Zhendong Wang, Keli Liu, Li Li, Ying Chen, Kai Li, Daowen Li, Houqiang Li

Abstract: Subject-driven generation is a critical task in creative AI; yet current state-of-the-art methods present a stark trade-off. They either rely on computationally expensive, per-subject fine-tuning, sacrificing efficiency and zero-shot capability, or employ feed-forward architectures built on diffusion models, which are inherently plagued by slow inference speeds. Visual Auto-Regressive (VAR) models… ▽ More Subject-driven generation is a critical task in creative AI; yet current state-of-the-art methods present a stark trade-off. They either rely on computationally expensive, per-subject fine-tuning, sacrificing efficiency and zero-shot capability, or employ feed-forward architectures built on diffusion models, which are inherently plagued by slow inference speeds. Visual Auto-Regressive (VAR) models are renowned for their rapid sampling speeds and strong generative quality, making them an ideal yet underexplored foundation for resolving this tension. To bridge this gap, we introduce EchoGen, a pioneering framework that empowers VAR models with subject-driven generation capabilities. The core design of EchoGen is an effective dual-path injection strategy that disentangles a subject's high-level semantic identity from its low-level fine-grained details, enabling enhanced controllability and fidelity. We employ a semantic encoder to extract the subject's abstract identity, which is injected through decoupled cross-attention to guide the overall composition. Concurrently, a content encoder captures intricate visual details, which are integrated via a multi-modal attention mechanism to ensure high-fidelity texture and structural preservation. To the best of our knowledge, EchoGen is the first feed-forward subject-driven framework built upon VAR models. Both quantitative and qualitative results substantiate our design, demonstrating that EchoGen achieves subject fidelity and image quality comparable to state-of-the-art diffusion-based methods with significantly lower sampling latency. Code and models will be released soon. △ Less

Submitted 30 September, 2025; originally announced September 2025.

arXiv:2509.23687 [pdf, ps, other]

Joint Hybrid Beamforming and Artificial Noise Design for Secure Multi-UAV ISAC Networks

Authors: Runze Dong, Buhong Wang, Cunqian Feng, Jiang Weng, Chen Han, Jiwei Tian

Abstract: Integrated sensing and communication (ISAC) emerges as a key enabler for next-generation applications such as smart cities and autonomous systems. Its integration with unmanned aerial vehicles (UAVs) unlocks new potentials for reliable communication and precise sensing in dynamic aerial environments. However, existing research predominantly treats UAVs as aerial base stations, overlooking their ro… ▽ More Integrated sensing and communication (ISAC) emerges as a key enabler for next-generation applications such as smart cities and autonomous systems. Its integration with unmanned aerial vehicles (UAVs) unlocks new potentials for reliable communication and precise sensing in dynamic aerial environments. However, existing research predominantly treats UAVs as aerial base stations, overlooking their role as ISAC users, and fails to leverage large-scale antenna arrays at terrestrial base stations to enhance security and spectral efficiency. This paper propose a secure and spectral efficient ISAC framework for multi-UAV networks, and a two-stage optimization approach is developed to jointly design hybrid beamforming (HBF), artificial noise (AN) injection, and UAV trajectories. Aiming at maximizing the sum secrecy rate, the first stage employs Proximal Policy Optimization (PPO) to optimize digital beamformers and trajectories, and the second stage decomposes the digital solution into analog and digital components via low-complexity matrix factorization. Simulation results demonstrate the effectiveness of the proposed framework compared to benchmark schemes. △ Less

Submitted 28 September, 2025; originally announced September 2025.

arXiv:2509.20447 [pdf, ps, other]

Neural Networks as Surrogate Solvers for Time-Dependent Accretion Disk Dynamics

Authors: Shunyuan Mao, Weiqi Wang, Sifan Wang, Ruobing Dong, Lu Lu, Kwang Moo Yi, Paris Perdikaris, Andrea Isella, Sébastien Fabbro, Lile Wang

Abstract: Accretion disks are ubiquitous in astrophysics, appearing in diverse environments from planet-forming systems to X-ray binaries and active galactic nuclei. Traditionally, modeling their dynamics requires computationally intensive (magneto)hydrodynamic simulations. Recently, Physics-Informed Neural Networks (PINNs) have emerged as a promising alternative. This approach trains neural networks direct… ▽ More Accretion disks are ubiquitous in astrophysics, appearing in diverse environments from planet-forming systems to X-ray binaries and active galactic nuclei. Traditionally, modeling their dynamics requires computationally intensive (magneto)hydrodynamic simulations. Recently, Physics-Informed Neural Networks (PINNs) have emerged as a promising alternative. This approach trains neural networks directly on physical laws without requiring data. We for the first time demonstrate PINNs for solving the two-dimensional, time-dependent hydrodynamics of non-self-gravitating accretion disks. Our models provide solutions at arbitrary times and locations within the training domain, and successfully reproduce key physical phenomena, including the excitation and propagation of spiral density waves and gap formation from disk-companion interactions. Notably, the boundary-free approach enabled by PINNs naturally eliminates the spurious wave reflections at disk edges, which are challenging to suppress in numerical simulations. These results highlight how advanced machine learning techniques can enable physics-driven, data-free modeling of complex astrophysical systems, potentially offering an alternative to traditional numerical simulations in the future. △ Less

Submitted 24 September, 2025; originally announced September 2025.

Comments: Astrophysical Journal Letters accepted; associate animations are available at https://doi.org/10.6084/m9.figshare.30192904

arXiv:2509.17359 [pdf, ps, other]

MLLM-Driven Semantic Identifier Generation for Generative Cross-Modal Retrieval

Authors: Tianyuan Li, Lei Wang, Ahtamjan Ahmat, Yating Yang, Bo Ma, Rui Dong, Bangju Han

Abstract: Generative cross-modal retrieval, which treats retrieval as a generation task, has emerged as a promising direction with the rise of Multimodal Large Language Models (MLLMs). In this setting, the model responds to a text query by generating an identifier corresponding to the target image. However, existing methods typically rely on manually crafted string IDs, clustering-based labels, or atomic id… ▽ More Generative cross-modal retrieval, which treats retrieval as a generation task, has emerged as a promising direction with the rise of Multimodal Large Language Models (MLLMs). In this setting, the model responds to a text query by generating an identifier corresponding to the target image. However, existing methods typically rely on manually crafted string IDs, clustering-based labels, or atomic identifiers requiring vocabulary expansion, all of which face challenges in semantic alignment or scalability.To address these limitations, we propose a vocabulary-efficient identifier generation framework that prompts MLLMs to generate Structured Semantic Identifiers from image-caption pairs. These identifiers are composed of concept-level tokens such as objects and actions, naturally aligning with the model's generation space without modifying the tokenizer. Additionally, we introduce a Rationale-Guided Supervision Strategy, prompting the model to produce a one-sentence explanation alongside each identifier serves as an auxiliary supervision signal that improves semantic grounding and reduces hallucinations during training. △ Less

Submitted 22 September, 2025; originally announced September 2025.

arXiv:2509.15404 [pdf, ps, other]

Trust-Aware Embodied Bayesian Persuasion for Mixed-Autonomy

Authors: Shaoting Peng, Katherine Driggs-Campbell, Roy Dong

Abstract: Safe and efficient interaction between autonomous vehicles (AVs) and human-driven vehicles (HVs) is a critical challenge for future transportation systems. While game-theoretic models capture how AVs influence HVs, they often suffer from a long-term decay of influence and can be perceived as manipulative, eroding the human's trust. This can paradoxically lead to riskier human driving behavior over… ▽ More Safe and efficient interaction between autonomous vehicles (AVs) and human-driven vehicles (HVs) is a critical challenge for future transportation systems. While game-theoretic models capture how AVs influence HVs, they often suffer from a long-term decay of influence and can be perceived as manipulative, eroding the human's trust. This can paradoxically lead to riskier human driving behavior over repeated interactions. In this paper, we address this challenge by proposing the Trust-Aware Embodied Bayesian Persuasion (TA-EBP) framework. Our work makes three key contributions: First, we apply Bayesian persuasion to model communication at traffic intersections, offering a transparent alternative to traditional game-theoretic models. Second, we introduce a trust parameter to the persuasion framework, deriving a theorem for the minimum trust level required for influence. Finally, we ground the abstract signals of Bayesian persuasion theory into a continuous, physically meaningful action space, deriving a second theorem for the optimal signal magnitude, realized as an AV's forward nudge. Additionally, we validate our framework in a mixed-autonomy traffic simulation, demonstrating that TA-EBP successfully persuades HVs to drive more cautiously, eliminating collisions and improving traffic flow compared to baselines that either ignore trust or lack communication. Our work provides a transparent and non-strategic framework for influence in human-robot interaction, enhancing both safety and efficiency. △ Less

Submitted 18 September, 2025; originally announced September 2025.

arXiv:2509.10873 [pdf, ps, other]

Automated Radiology Report Generation Based on Topic-Keyword Semantic Guidance

Authors: Jing Xiao, Hongfei Liu, Ruiqi Dong, Jimin Liu, Haoyong Yu

Abstract: Automated radiology report generation is essential in clinical practice. However, diagnosing radiological images typically requires physicians 5-10 minutes, resulting in a waste of valuable healthcare resources. Existing studies have not fully leveraged knowledge from historical radiology reports, lacking sufficient and accurate prior information. To address this, we propose a Topic-Keyword Semant… ▽ More Automated radiology report generation is essential in clinical practice. However, diagnosing radiological images typically requires physicians 5-10 minutes, resulting in a waste of valuable healthcare resources. Existing studies have not fully leveraged knowledge from historical radiology reports, lacking sufficient and accurate prior information. To address this, we propose a Topic-Keyword Semantic Guidance (TKSG) framework. This framework uses BiomedCLIP to accurately retrieve historical similar cases. Supported by multimodal, TKSG accurately detects topic words (disease classifications) and keywords (common symptoms) in diagnoses. The probabilities of topic terms are aggregated into a topic vector, serving as global information to guide the entire decoding process. Additionally, a semantic-guided attention module is designed to refine local decoding with keyword content, ensuring report accuracy and relevance. Experimental results show that our model achieves excellent performance on both IU X-Ray and MIMIC-CXR datasets. The code is available at https://github.com/SCNU203/TKSG. △ Less

Submitted 13 September, 2025; originally announced September 2025.

arXiv:2509.08418 [pdf, ps, other]

Facet: highly efficient E(3)-equivariant networks for interatomic potentials

Authors: Nicholas Miklaucic, Lai Wei, Rongzhi Dong, Nihang Fu, Sadman Sadeed Omee, Qingyang Li, Sourin Dey, Victor Fung, Jianjun Hu

Abstract: Computational materials discovery is limited by the high cost of first-principles calculations. Machine learning (ML) potentials that predict energies from crystal structures are promising, but existing methods face computational bottlenecks. Steerable graph neural networks (GNNs) encode geometry with spherical harmonics, respecting atomic symmetries -- permutation, rotation, and translation -- fo… ▽ More Computational materials discovery is limited by the high cost of first-principles calculations. Machine learning (ML) potentials that predict energies from crystal structures are promising, but existing methods face computational bottlenecks. Steerable graph neural networks (GNNs) encode geometry with spherical harmonics, respecting atomic symmetries -- permutation, rotation, and translation -- for physically realistic predictions. Yet maintaining equivariance is difficult: activation functions must be modified, and each layer must handle multiple data types for different harmonic orders. We present Facet, a GNN architecture for efficient ML potentials, developed through systematic analysis of steerable GNNs. Our innovations include replacing expensive multi-layer perceptrons (MLPs) for interatomic distances with splines, which match performance while cutting computational and memory demands. We also introduce a general-purpose equivariant layer that mixes node information via spherical grid projection followed by standard MLPs -- faster than tensor products and more expressive than linear or gate layers. On the MPTrj dataset, Facet matches leading models with far fewer parameters and under 10% of their training compute. On a crystal relaxation task, it runs twice as fast as MACE models. We further show SevenNet-0's parameters can be reduced by over 25% with no accuracy loss. These techniques enable more than 10x faster training of large-scale foundation models for ML potentials, potentially reshaping computational materials discovery. △ Less

Submitted 10 September, 2025; originally announced September 2025.

arXiv:2509.08199 [pdf, ps, other]

Algorithmic Tradeoffs, Applied NLP, and the State-of-the-Art Fallacy

Authors: AJ Alvero, Ruohong Dong, Klint Kanopka, David Lang

Abstract: Computational sociology is growing in popularity, yet the analytic tools employed differ widely in power, transparency, and interpretability. In computer science, methods gain popularity after surpassing benchmarks of predictive accuracy, becoming the "state of the art." Computer scientists favor novelty and innovation for different reasons, but prioritizing technical prestige over methodological… ▽ More Computational sociology is growing in popularity, yet the analytic tools employed differ widely in power, transparency, and interpretability. In computer science, methods gain popularity after surpassing benchmarks of predictive accuracy, becoming the "state of the art." Computer scientists favor novelty and innovation for different reasons, but prioritizing technical prestige over methodological fit could unintentionally limit the scope of sociological inquiry. To illustrate, we focus on computational text analysis and revisit a prior study of college admissions essays, comparing analyses with both older and newer methods. These methods vary in flexibility and opacity, allowing us to compare performance across distinct methodological regimes. We find that newer techniques did not outperform prior results in meaningful ways. We also find that using the current state of the art, generative AI and large language models, could introduce bias and confounding that is difficult to extricate. We therefore argue that sociological inquiry benefits from methodological pluralism that aligns analytic choices with theoretical and empirical questions. While we frame this sociologically, scholars in other disciplines may confront what we call the "state-of-the-art fallacy", the belief that the tool computer scientists deem to be the best will work across topics, domains, and questions. △ Less

Submitted 9 September, 2025; originally announced September 2025.

arXiv:2509.07594 [pdf, ps, other]

doi 10.1145/3726302.3730188

ELEC: Efficient Large Language Model-Empowered Click-Through Rate Prediction

Authors: Rui Dong, Wentao Ouyang, Xiangzheng Liu

Abstract: Click-through rate (CTR) prediction plays an important role in online advertising systems. On the one hand, traditional CTR prediction models capture the collaborative signals in tabular data via feature interaction modeling, but they lose semantics in text. On the other hand, Large Language Models (LLMs) excel in understanding the context and meaning behind text, but they face challenges in captu… ▽ More Click-through rate (CTR) prediction plays an important role in online advertising systems. On the one hand, traditional CTR prediction models capture the collaborative signals in tabular data via feature interaction modeling, but they lose semantics in text. On the other hand, Large Language Models (LLMs) excel in understanding the context and meaning behind text, but they face challenges in capturing collaborative signals and they have long inference latency. In this paper, we aim to leverage the benefits of both types of models and pursue collaboration, semantics and efficiency. We present ELEC, which is an Efficient LLM-Empowered CTR prediction framework. We first adapt an LLM for the CTR prediction task. In order to leverage the ability of the LLM but simultaneously keep efficiency, we utilize the pseudo-siamese network which contains a gain network and a vanilla network. We inject the high-level representation vector generated by the LLM into a collaborative CTR model to form the gain network such that it can take advantage of both tabular modeling and textual modeling. However, its reliance on the LLM limits its efficiency. We then distill the knowledge from the gain network to the vanilla network on both the score level and the representation level, such that the vanilla network takes only tabular data as input, but can still generate comparable performance as the gain network. Our approach is model-agnostic. It allows for the integration with various existing LLMs and collaborative CTR models. Experiments on real-world datasets demonstrate the effectiveness and efficiency of ELEC for CTR prediction. △ Less

Submitted 9 September, 2025; originally announced September 2025.

Comments: SIGIR 2025

arXiv:2509.06976 [pdf]

A Knowledge-Guided Cross-Modal Feature Fusion Model for Local Traffic Demand Prediction

Authors: Lingyu Zhang, Pengfei Xu, Guobin Wu, Jian Liang, Ruiyang Dong, Yunhai Wang, Xuan Song

Abstract: Traffic demand prediction plays a critical role in intelligent transportation systems. Existing traffic prediction models primarily rely on temporal traffic data, with limited efforts incorporating human knowledge and experience for urban traffic demand forecasting. However, in real-world scenarios, traffic knowledge and experience derived from human daily life significantly influence precise traf… ▽ More Traffic demand prediction plays a critical role in intelligent transportation systems. Existing traffic prediction models primarily rely on temporal traffic data, with limited efforts incorporating human knowledge and experience for urban traffic demand forecasting. However, in real-world scenarios, traffic knowledge and experience derived from human daily life significantly influence precise traffic prediction. Such knowledge and experiences can guide the model in uncovering latent patterns within traffic data, thereby enhancing the accuracy and robustness of predictions. To this end, this paper proposes integrating structured temporal traffic data with textual data representing human knowledge and experience, resulting in a novel knowledge-guided cross-modal feature representation learning (KGCM) model for traffic demand prediction. Based on regional transportation characteristics, we construct a prior knowledge dataset using a large language model combined with manual authoring and revision, covering both regional and global knowledge and experiences. The KGCM model then learns multimodal data features through designed local and global adaptive graph networks, as well as a cross-modal feature fusion mechanism. A proposed reasoning-based dynamic update strategy enables dynamic optimization of the graph model's parameters, achieving optimal performance. Experiments on multiple traffic datasets demonstrate that our model accurately predicts future traffic demand and outperforms existing state-of-the-art (SOTA) models. △ Less

Submitted 29 August, 2025; originally announced September 2025.

arXiv:2509.06499 [pdf, ps, other]

TIDE: Achieving Balanced Subject-Driven Image Generation via Target-Instructed Diffusion Enhancement

Authors: Jibai Lin, Bo Ma, Yating Yang, Xi Zhou, Rong Ma, Turghun Osman, Ahtamjan Ahmat, Rui Dong, Lei Wang

Abstract: Subject-driven image generation (SDIG) aims to manipulate specific subjects within images while adhering to textual instructions, a task crucial for advancing text-to-image diffusion models. SDIG requires reconciling the tension between maintaining subject identity and complying with dynamic edit instructions, a challenge inadequately addressed by existing methods. In this paper, we introduce the… ▽ More Subject-driven image generation (SDIG) aims to manipulate specific subjects within images while adhering to textual instructions, a task crucial for advancing text-to-image diffusion models. SDIG requires reconciling the tension between maintaining subject identity and complying with dynamic edit instructions, a challenge inadequately addressed by existing methods. In this paper, we introduce the Target-Instructed Diffusion Enhancing (TIDE) framework, which resolves this tension through target supervision and preference learning without test-time fine-tuning. TIDE pioneers target-supervised triplet alignment, modelling subject adaptation dynamics using a (reference image, instruction, target images) triplet. This approach leverages the Direct Subject Diffusion (DSD) objective, training the model with paired "winning" (balanced preservation-compliance) and "losing" (distorted) targets, systematically generated and evaluated via quantitative metrics. This enables implicit reward modelling for optimal preservation-compliance balance. Experimental results on standard benchmarks demonstrate TIDE's superior performance in generating subject-faithful outputs while maintaining instruction compliance, outperforming baseline methods across multiple quantitative metrics. TIDE's versatility is further evidenced by its successful application to diverse tasks, including structural-conditioned generation, image-to-image generation, and text-image interpolation. Our code is available at https://github.com/KomJay520/TIDE. △ Less

Submitted 18 September, 2025; v1 submitted 8 September, 2025; originally announced September 2025.

arXiv:2508.12546 [pdf, ps, other]

XAMT: Cross-Framework API Matching for Testing Deep Learning Libraries

Authors: Bin Duan, Ruican Dong, Naipeng Dong, Dan Dongseong Kim, Guowei Yang

Abstract: Deep learning powers critical applications such as autonomous driving, healthcare, and finance, where the correctness of underlying libraries is essential. Bugs in widely used deep learning APIs can propagate to downstream systems, causing serious consequences. While existing fuzzing techniques detect bugs through intra-framework testing across hardware backends (CPU vs. GPU), they may miss bugs t… ▽ More Deep learning powers critical applications such as autonomous driving, healthcare, and finance, where the correctness of underlying libraries is essential. Bugs in widely used deep learning APIs can propagate to downstream systems, causing serious consequences. While existing fuzzing techniques detect bugs through intra-framework testing across hardware backends (CPU vs. GPU), they may miss bugs that manifest identically across backends and thus escape detection under these strategies. To address this problem, we propose XAMT, a cross-framework fuzzing method that tests deep learning libraries by matching and comparing functionally equivalent APIs across different frameworks. XAMT matches APIs using similarity-based rules based on names, descriptions, and parameter structures. It then aligns inputs and applies variance-guided differential testing to detect bugs. We evaluated XAMT on five popular frameworks, including PyTorch, TensorFlow, Keras, Chainer, and JAX. XAMT matched 839 APIs and identified 238 matched API groups, and detected 17 bugs, 12 of which have been confirmed. Our results show that XAMT uncovers bugs undetectable by intra-framework testing, especially those that manifest consistently across backends. XAMT offers a complementary approach to existing methods and offers a new perspective on the testing of deep learning libraries. △ Less

Submitted 17 August, 2025; originally announced August 2025.

arXiv:2507.21134 [pdf, ps, other]

TRIDENT: Benchmarking LLM Safety in Finance, Medicine, and Law

Authors: Zheng Hui, Yijiang River Dong, Ehsan Shareghi, Nigel Collier

Abstract: As large language models (LLMs) are increasingly deployed in high-risk domains such as law, finance, and medicine, systematically evaluating their domain-specific safety and compliance becomes critical. While prior work has largely focused on improving LLM performance in these domains, it has often neglected the evaluation of domain-specific safety risks. To bridge this gap, we first define domain… ▽ More As large language models (LLMs) are increasingly deployed in high-risk domains such as law, finance, and medicine, systematically evaluating their domain-specific safety and compliance becomes critical. While prior work has largely focused on improving LLM performance in these domains, it has often neglected the evaluation of domain-specific safety risks. To bridge this gap, we first define domain-specific safety principles for LLMs based on the AMA Principles of Medical Ethics, the ABA Model Rules of Professional Conduct, and the CFA Institute Code of Ethics. Building on this foundation, we introduce Trident-Bench, a benchmark specifically targeting LLM safety in the legal, financial, and medical domains. We evaluated 19 general-purpose and domain-specialized models on Trident-Bench and show that it effectively reveals key safety gaps -- strong generalist models (e.g., GPT, Gemini) can meet basic expectations, whereas domain-specialized models often struggle with subtle ethical nuances. This highlights an urgent need for finer-grained domain-specific safety improvements. By introducing Trident-Bench, our work provides one of the first systematic resources for studying LLM safety in law and finance, and lays the groundwork for future research aimed at reducing the safety risks of deploying LLMs in professionally regulated fields. Code and benchmark will be released at: https://github.com/zackhuiiiii/TRIDENT △ Less

Submitted 22 July, 2025; originally announced July 2025.

arXiv:2507.04447 [pdf, ps, other]

DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

Authors: Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xinqiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, Fan Lu, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, Xin Jin

Abstract: Recent advances in vision-language-action (VLA) models have shown promise in integrating image generation with action prediction to improve generalization and reasoning in robot manipulation. However, existing methods are limited to challenging image-based forecasting, which suffers from redundant information and lacks comprehensive and critical world knowledge, including dynamic, spatial and sema… ▽ More Recent advances in vision-language-action (VLA) models have shown promise in integrating image generation with action prediction to improve generalization and reasoning in robot manipulation. However, existing methods are limited to challenging image-based forecasting, which suffers from redundant information and lacks comprehensive and critical world knowledge, including dynamic, spatial and semantic information. To address these limitations, we propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling, thereby establishing a perception-prediction-action loop for manipulation tasks. Specifically, DreamVLA introduces a dynamic-region-guided world knowledge prediction, integrated with the spatial and semantic cues, which provide compact yet comprehensive representations for action planning. This design aligns with how humans interact with the world by first forming abstract multimodal reasoning chains before acting. To mitigate interference among the dynamic, spatial and semantic information during training, we adopt a block-wise structured attention mechanism that masks their mutual attention, preventing information leakage and keeping each representation clean and disentangled. Moreover, to model the conditional distribution over future actions, we employ a diffusion-based transformer that disentangles action representations from shared latent features. Extensive experiments on both real-world and simulation environments demonstrate that DreamVLA achieves 76.7% success rate on real robot tasks and 4.44 average length on the CALVIN ABC-D benchmarks. △ Less

Submitted 26 August, 2025; v1 submitted 6 July, 2025; originally announced July 2025.

arXiv:2507.02018 [pdf, ps, other]

NGAT: A Node-level Graph Attention Network for Long-term Stock Prediction

Authors: Yingjie Niu, Mingchuan Zhao, Valerio Poti, Ruihai Dong

Abstract: Graph representation learning methods have been widely adopted in financial applications to enhance company representations by leveraging inter-firm relationships. However, current approaches face three key challenges: (1) The advantages of relational information are obscured by limitations in downstream task designs; (2) Existing graph models specifically designed for stock prediction often suffe… ▽ More Graph representation learning methods have been widely adopted in financial applications to enhance company representations by leveraging inter-firm relationships. However, current approaches face three key challenges: (1) The advantages of relational information are obscured by limitations in downstream task designs; (2) Existing graph models specifically designed for stock prediction often suffer from excessive complexity and poor generalization; (3) Experience-based construction of corporate relationship graphs lacks effective comparison of different graph structures. To address these limitations, we propose a long-term stock prediction task and develop a Node-level Graph Attention Network (NGAT) specifically tailored for corporate relationship graphs. Furthermore, we experimentally demonstrate the limitations of existing graph comparison methods based on model downstream task performance. Experimental results across two datasets consistently demonstrate the effectiveness of our proposed task and model. The project is publicly available on GitHub to encourage reproducibility and future research. △ Less

Submitted 2 July, 2025; originally announced July 2025.

ACM Class: I.2.1

arXiv:2506.04650 [pdf, ps, other]

Neural Network Reprogrammability: A Unified Theme on Model Reprogramming, Prompt Tuning, and Prompt Instruction

Authors: Zesheng Ye, Chengyi Cai, Ruijiang Dong, Jianzhong Qi, Lei Feng, Pin-Yu Chen, Feng Liu

Abstract: As large-scale pre-trained foundation models continue to expand in size and capability, efficiently adapting them to specific downstream tasks has become increasingly critical. Despite substantial progress, existing adaptation approaches have evolved largely in isolation, without a clear understanding of their interrelationships. This survey introduces neural network reprogrammability as a unifyin… ▽ More As large-scale pre-trained foundation models continue to expand in size and capability, efficiently adapting them to specific downstream tasks has become increasingly critical. Despite substantial progress, existing adaptation approaches have evolved largely in isolation, without a clear understanding of their interrelationships. This survey introduces neural network reprogrammability as a unifying framework that bridges mainstream model adaptation techniques--model reprogramming, prompt tuning, and prompt instruction--previously fragmented research areas yet converges on a shared principle: repurposing a pre-trained model by manipulating information at the interfaces while keeping the model parameters frozen. These methods exploit neural networks' sensitivity to manipulation on different interfaces, be it through perturbing inputs, inserting tokens into intermediate layers, or providing task-specific examples in context, to redirect model behaviors towards desired outcomes. We then present a taxonomy that categorizes such information manipulation-based adaptation approaches across four key dimensions: manipulation format (fixed or learnable), location (interfaces where manipulations occur), operator (how they are applied), and output alignment requirement (post-processing needed to align outputs with downstream tasks). Notably, this framework applies consistently across data modalities, independent of specific model architectures. Moreover, viewing established techniques like in-context learning and chain-of-thought prompting through this lens reveals both their theoretical connections and practical distinctions. We further analyze remaining technical challenges and ethical considerations, positioning neural network reprogrammability as a fundamental paradigm for efficient model adaptation. We lastly identify promising research directions emerging from this integrative viewpoint. △ Less

Submitted 13 June, 2025; v1 submitted 5 June, 2025; originally announced June 2025.

arXiv:2505.24863 [pdf, ps, other]

AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time

Authors: Junyu Zhang, Runpei Dong, Han Wang, Xuying Ning, Haoran Geng, Peihao Li, Xialin He, Yutong Bai, Jitendra Malik, Saurabh Gupta, Huan Zhang

Abstract: This paper presents AlphaOne ($α$1), a universal framework for modulating reasoning progress in large reasoning models (LRMs) at test time. $α$1 first introduces $α$ moment, which represents the scaled thinking phase with a universal parameter $α$. Within this scaled pre-$α$ moment phase, it dynamically schedules slow thinking transitions by modeling the insertion of reasoning transition tokens as… ▽ More This paper presents AlphaOne ($α$1), a universal framework for modulating reasoning progress in large reasoning models (LRMs) at test time. $α$1 first introduces $α$ moment, which represents the scaled thinking phase with a universal parameter $α$. Within this scaled pre-$α$ moment phase, it dynamically schedules slow thinking transitions by modeling the insertion of reasoning transition tokens as a Bernoulli stochastic process. After the $α$ moment, $α$1 deterministically terminates slow thinking with the end-of-thinking token, thereby fostering fast reasoning and efficient answer generation. This approach unifies and generalizes existing monotonic scaling methods by enabling flexible and dense slow-to-fast reasoning modulation. Extensive empirical studies on various challenging benchmarks across mathematical, coding, and scientific domains demonstrate $α$1's superior reasoning capability and efficiency. Project page: https://alphaone-project.github.io/ △ Less

Submitted 30 May, 2025; originally announced May 2025.

arXiv:2505.19141 [pdf, ps, other]

S-unit equations in modules and linear-exponential Diophantine equations

Authors: Ruiwen Dong, Doron Shafrir

Abstract: Let $T$ be a positive integer, and $\mathcal{M}$ be a finitely presented module over the Laurent polynomial ring $\mathbb{Z}_{/T}[X_1^{\pm}, \ldots, X_N^{\pm}]$. We consider S-unit equations over $\mathcal{M}$: these are equations of the form $x_1 m_1 + \cdots + x_K m_K = m_0$, where the variables $x_1, \ldots, x_K$ range over the set of monomials (with coefficient 1) of… ▽ More Let $T$ be a positive integer, and $\mathcal{M}$ be a finitely presented module over the Laurent polynomial ring $\mathbb{Z}_{/T}[X_1^{\pm}, \ldots, X_N^{\pm}]$. We consider S-unit equations over $\mathcal{M}$: these are equations of the form $x_1 m_1 + \cdots + x_K m_K = m_0$, where the variables $x_1, \ldots, x_K$ range over the set of monomials (with coefficient 1) of $\mathbb{Z}_{/T}[X_1^{\pm}, \ldots, X_N^{\pm}]$. When $T$ is a power of a prime number $p$, we show that the solution set of an S-unit equation over $\mathcal{M}$ is effectively $p$-normal in the sense of Derksen and Masser (2015), generalizing their result on S-unit equations in fields of prime characteristic. When $T$ is an arbitrary positive integer, we show that deciding whether an S-unit equation over $\mathcal{M}$ admits a solution is Turing equivalent to solving a system of linear-exponential Diophantine equations, whose base contains the prime divisors of $T$. Combined with a recent result of Karimov, Luca, Nieuwveld, Ouaknine and Worrell (2025), this yields decidability when $T$ has at most two distinct prime divisors. This also shows that proving either decidability or undecidability in the case of arbitrary $T$ would entail major breakthroughs in number theory. We mention some potential applications of our results, such as deciding Submonoid Membership in wreath products of the form $\mathbb{Z}_{/p^a q^b} \wr \mathbb{Z}^d$, as well as progressing towards solving the Skolem problem in rings whose additive group is torsion. More connections in these directions will be explored in follow up papers. △ Less

Submitted 27 May, 2025; v1 submitted 25 May, 2025; originally announced May 2025.

Comments: 80 pages, corrected spelling mistake for a name

arXiv:2504.11588 [pdf, other]

Deep Learning Approaches for Medical Imaging Under Varying Degrees of Label Availability: A Comprehensive Survey

Authors: Siteng Ma, Honghui Du, Yu An, Jing Wang, Qinqin Wang, Haochang Wu, Aonghus Lawlor, Ruihai Dong

Abstract: Deep learning has achieved significant breakthroughs in medical imaging, but these advancements are often dependent on large, well-annotated datasets. However, obtaining such datasets poses a significant challenge, as it requires time-consuming and labor-intensive annotations from medical experts. Consequently, there is growing interest in learning paradigms such as incomplete, inexact, and absent… ▽ More Deep learning has achieved significant breakthroughs in medical imaging, but these advancements are often dependent on large, well-annotated datasets. However, obtaining such datasets poses a significant challenge, as it requires time-consuming and labor-intensive annotations from medical experts. Consequently, there is growing interest in learning paradigms such as incomplete, inexact, and absent supervision, which are designed to operate under limited, inexact, or missing labels. This survey categorizes and reviews the evolving research in these areas, analyzing around 600 notable contributions since 2018. It covers tasks such as image classification, segmentation, and detection across various medical application areas, including but not limited to brain, chest, and cardiac imaging. We attempt to establish the relationships among existing research studies in related areas. We provide formal definitions of different learning paradigms and offer a comprehensive summary and interpretation of various learning mechanisms and strategies, aiding readers in better understanding the current research landscape and ideas. We also discuss potential future research challenges. △ Less

Submitted 15 April, 2025; originally announced April 2025.

Comments: 33 pages, 10 figures, 8 tables. Will be submit to Medical Image Analysis

MSC Class: 68T07; 68T45; 92C50; 92C55 ACM Class: I.2.10; I.4.5; I.4.6; I.4.9; J.3

arXiv:2504.07165 [pdf, other]

Perception in Reflection

Authors: Yana Wei, Liang Zhao, Kangheng Lin, En Yu, Yuang Peng, Runpei Dong, Jianjian Sun, Haoran Wei, Zheng Ge, Xiangyu Zhang, Vishal M. Patel

Abstract: We present a perception in reflection paradigm designed to transcend the limitations of current large vision-language models (LVLMs), which are expected yet often fail to achieve perfect perception initially. Specifically, we propose Reflective Perception (RePer), a dual-model reflection mechanism that systematically alternates between policy and critic models, enables iterative refinement of visu… ▽ More We present a perception in reflection paradigm designed to transcend the limitations of current large vision-language models (LVLMs), which are expected yet often fail to achieve perfect perception initially. Specifically, we propose Reflective Perception (RePer), a dual-model reflection mechanism that systematically alternates between policy and critic models, enables iterative refinement of visual perception. This framework is powered by Reflective Perceptual Learning (RPL), which reinforces intrinsic reflective capabilities through a methodically constructed visual reflection dataset and reflective unlikelihood training. Comprehensive experimental evaluation demonstrates RePer's quantifiable improvements in image understanding, captioning precision, and hallucination reduction. Notably, RePer achieves strong alignment between model attention patterns and human visual focus, while RPL optimizes fine-grained and free-form preference alignment. These advancements establish perception in reflection as a robust paradigm for future multimodal agents, particularly in tasks requiring complex reasoning and multi-step manipulation. △ Less

Submitted 9 April, 2025; originally announced April 2025.

arXiv:2503.18948 [pdf, other]

Equivariant Image Modeling

Authors: Ruixiao Dong, Mengde Xu, Zigang Geng, Li Li, Han Hu, Shuyang Gu

Abstract: Current generative models, such as autoregressive and diffusion approaches, decompose high-dimensional data distribution learning into a series of simpler subtasks. However, inherent conflicts arise during the joint optimization of these subtasks, and existing solutions fail to resolve such conflicts without sacrificing efficiency or scalability. We propose a novel equivariant image modeling frame… ▽ More Current generative models, such as autoregressive and diffusion approaches, decompose high-dimensional data distribution learning into a series of simpler subtasks. However, inherent conflicts arise during the joint optimization of these subtasks, and existing solutions fail to resolve such conflicts without sacrificing efficiency or scalability. We propose a novel equivariant image modeling framework that inherently aligns optimization targets across subtasks by leveraging the translation invariance of natural visual signals. Our method introduces (1) column-wise tokenization which enhances translational symmetry along the horizontal axis, and (2) windowed causal attention which enforces consistent contextual relationships across positions. Evaluated on class-conditioned ImageNet generation at 256x256 resolution, our approach achieves performance comparable to state-of-the-art AR models while using fewer computational resources. Systematic analysis demonstrates that enhanced equivariance reduces inter-task conflicts, significantly improving zero-shot generalization and enabling ultra-long image synthesis. This work establishes the first framework for task-aligned decomposition in generative modeling, offering insights into efficient parameter sharing and conflict-free optimization. The code and models are publicly available at https://github.com/drx-code/EquivariantModeling. △ Less

Submitted 24 March, 2025; originally announced March 2025.

arXiv:2503.17893 [pdf, other]

Modeling Utilization to Identify Shared-Memory Atomic Bottlenecks

Authors: Rongcui Dong, Sreepathi Pai

Abstract: Performance analysis is critical for GPU programs with data-dependent behavior, but models like Roofline are not very useful for them and interpreting raw performance counters is tedious. In this work, we present an analytical model for shared memory atomics (\emph{fetch-and-op} and \emph{compare-and-swap} instructions on NVIDIA Volta and Ampere GPU) that allows users to immediately determine if s… ▽ More Performance analysis is critical for GPU programs with data-dependent behavior, but models like Roofline are not very useful for them and interpreting raw performance counters is tedious. In this work, we present an analytical model for shared memory atomics (\emph{fetch-and-op} and \emph{compare-and-swap} instructions on NVIDIA Volta and Ampere GPU) that allows users to immediately determine if shared memory atomic operations are a bottleneck for a program's execution. Our model is based on modeling the architecture as a single-server queuing model whose inputs are performance counters. It captures load-dependent behavior such as pipelining, parallelism, and different access patterns. We embody this model in a tool that uses CUDA hardware counters as parameters to predict the utilization of the shared-memory atomic unit. To the best of our knowledge, no existing profiling tool or model provides this capability for shared-memory atomic operations. We used the model to compare two histogram kernels that use shared-memory atomics. Although nearly identical, their performance can be different by up to 30\%. Our tool correctly identifies a bottleneck shift from shared-memory atomic unit as the cause of this discrepancy. △ Less

Submitted 22 March, 2025; originally announced March 2025.

Comments: GPGPU 2025

arXiv:2503.10497 [pdf, other]

MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation

Authors: Weihao Xuan, Rui Yang, Heli Qi, Qingcheng Zeng, Yunze Xiao, Aosong Feng, Dairui Liu, Yun Xing, Junjue Wang, Fan Gao, Jinghui Lu, Yuang Jiang, Huitao Li, Xin Li, Kunyu Yu, Ruihai Dong, Shangding Gu, Yuekang Li, Xiaofei Xie, Felix Juefei-Xu, Foutse Khomh, Osamu Yoshie, Qingyu Chen, Douglas Teodoro, Nan Liu , et al. (7 additional authors not shown)

Abstract: Existing large language model (LLM) evaluation benchmarks primarily focus on English, while current multilingual tasks lack parallel questions that specifically assess cross-linguistic reasoning abilities. This dual limitation makes it challenging to comprehensively assess LLMs' performance in the multilingual setting. To fill this gap, we introduce MMLU-ProX, a comprehensive benchmark covering 29… ▽ More Existing large language model (LLM) evaluation benchmarks primarily focus on English, while current multilingual tasks lack parallel questions that specifically assess cross-linguistic reasoning abilities. This dual limitation makes it challenging to comprehensively assess LLMs' performance in the multilingual setting. To fill this gap, we introduce MMLU-ProX, a comprehensive benchmark covering 29 languages, built on an English benchmark. Each language version consists of 11,829 identical questions, enabling direct cross-linguistic comparisons. Additionally, to meet efficient evaluation needs, we provide a lite version containing 658 questions per language. To ensure the high quality of MMLU-ProX, we employ a rigorous development process that involves multiple powerful LLMs for translation, followed by expert review to ensure accurate expression, consistent terminology, and cultural relevance. Building on this, we systematically evaluate 36 state-of-the-art LLMs, including reasoning-enhanced and multilingual-optimized LLMs. The results reveal significant disparities in the multilingual capabilities of LLMs: While they perform well in high-resource languages, their performance declines markedly in low-resource languages, with gaps of up to 24.3%. Through MMLU-ProX, we aim to advance the development of more inclusive AI systems and promote equitable access to technology across global contexts. △ Less

Submitted 26 May, 2025; v1 submitted 13 March, 2025; originally announced March 2025.

arXiv:2502.19158 [pdf, other]

When Personalization Meets Reality: A Multi-Faceted Analysis of Personalized Preference Learning

Authors: Yijiang River Dong, Tiancheng Hu, Yinhong Liu, Ahmet Üstün, Nigel Collier

Abstract: While Reinforcement Learning from Human Feedback (RLHF) is widely used to align Large Language Models (LLMs) with human preferences, it typically assumes homogeneous preferences across users, overlooking diverse human values and minority viewpoints. Although personalized preference learning addresses this by tailoring separate preferences for individual users, the field lacks standardized methods… ▽ More While Reinforcement Learning from Human Feedback (RLHF) is widely used to align Large Language Models (LLMs) with human preferences, it typically assumes homogeneous preferences across users, overlooking diverse human values and minority viewpoints. Although personalized preference learning addresses this by tailoring separate preferences for individual users, the field lacks standardized methods to assess its effectiveness. We present a multi-faceted evaluation framework that measures not only performance but also fairness, unintended effects, and adaptability across varying levels of preference divergence. Through extensive experiments comparing eight personalization methods across three preference datasets, we demonstrate that performance differences between methods could reach 36% when users strongly disagree, and personalization can introduce up to 20% safety misalignment. These findings highlight the critical need for holistic evaluation approaches to advance the development of more effective and inclusive preference learning systems. △ Less

Submitted 26 February, 2025; originally announced February 2025.

arXiv:2502.13143 [pdf, ps, other]

SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

Authors: Zekun Qi, Wenyao Zhang, Yufei Ding, Runpei Dong, Xinqiang Yu, Jingwen Li, Lingyun Xu, Baoyu Li, Xialin He, Guofan Fan, Jiazhao Zhang, Jiawei He, Jiayuan Gu, Xin Jin, Kaisheng Ma, Zhizheng Zhang, He Wang, Li Yi

Abstract: While spatial reasoning has made progress in object localization relationships, it often overlooks object orientation-a key factor in 6-DoF fine-grained manipulation. Traditional pose representations rely on pre-defined frames or templates, limiting generalization and semantic grounding. In this paper, we introduce the concept of semantic orientation, which defines object orientations using natura… ▽ More While spatial reasoning has made progress in object localization relationships, it often overlooks object orientation-a key factor in 6-DoF fine-grained manipulation. Traditional pose representations rely on pre-defined frames or templates, limiting generalization and semantic grounding. In this paper, we introduce the concept of semantic orientation, which defines object orientations using natural language in a reference-frame-free manner (e.g., the "plug-in" direction of a USB or the "handle" direction of a cup). To support this, we construct OrienText300K, a large-scale dataset of 3D objects annotated with semantic orientations, and develop PointSO, a general model for zero-shot semantic orientation prediction. By integrating semantic orientation into VLM agents, our SoFar framework enables 6-DoF spatial reasoning and generates robotic actions. Extensive experiments demonstrated the effectiveness and generalization of our SoFar, e.g., zero-shot 48.7% successful rate on Open6DOR and zero-shot 74.9% successful rate on SIMPLER-Env. △ Less

Submitted 23 September, 2025; v1 submitted 18 February, 2025; originally announced February 2025.

Comments: Accepted at NeurIPS 2025 Spotlight

arXiv:2502.12152 [pdf, other]

Learning Getting-Up Policies for Real-World Humanoid Robots

Authors: Xialin He, Runpei Dong, Zixuan Chen, Saurabh Gupta

Abstract: Automatic fall recovery is a crucial prerequisite before humanoid robots can be reliably deployed. Hand-designing controllers for getting up is difficult because of the varied configurations a humanoid can end up in after a fall and the challenging terrains humanoid robots are expected to operate on. This paper develops a learning framework to produce controllers that enable humanoid robots to get… ▽ More Automatic fall recovery is a crucial prerequisite before humanoid robots can be reliably deployed. Hand-designing controllers for getting up is difficult because of the varied configurations a humanoid can end up in after a fall and the challenging terrains humanoid robots are expected to operate on. This paper develops a learning framework to produce controllers that enable humanoid robots to get up from varying configurations on varying terrains. Unlike previous successful applications of learning to humanoid locomotion, the getting-up task involves complex contact patterns (which necessitates accurately modeling of the collision geometry) and sparser rewards. We address these challenges through a two-phase approach that induces a curriculum. The first stage focuses on discovering a good getting-up trajectory under minimal constraints on smoothness or speed / torque limits. The second stage then refines the discovered motions into deployable (i.e. smooth and slow) motions that are robust to variations in initial configuration and terrains. We find these innovations enable a real-world G1 humanoid robot to get up from two main situations that we considered: a) lying face up and b) lying face down, both tested on flat, deformable, slippery surfaces and slopes (e.g., sloppy grass and snowfield). This is one of the first successful demonstrations of learned getting-up policies for human-sized humanoid robots in the real world. △ Less

Submitted 27 April, 2025; v1 submitted 17 February, 2025; originally announced February 2025.

Comments: Robotics: Science and Systems (RSS), 2025. Project page: https://humanoid-getup.github.io/

arXiv:2502.06221 [pdf, other]

Interaction-aware Conformal Prediction for Crowd Navigation

Authors: Zhe Huang, Tianchen Ji, Heling Zhang, Fatemeh Cheraghi Pouria, Katherine Driggs-Campbell, Roy Dong

Abstract: During crowd navigation, robot motion plan needs to consider human motion uncertainty, and the human motion uncertainty is dependent on the robot motion plan. We introduce Interaction-aware Conformal Prediction (ICP) to alternate uncertainty-aware robot motion planning and decision-dependent human motion uncertainty quantification. ICP is composed of a trajectory predictor to predict human traject… ▽ More During crowd navigation, robot motion plan needs to consider human motion uncertainty, and the human motion uncertainty is dependent on the robot motion plan. We introduce Interaction-aware Conformal Prediction (ICP) to alternate uncertainty-aware robot motion planning and decision-dependent human motion uncertainty quantification. ICP is composed of a trajectory predictor to predict human trajectories, a model predictive controller to plan robot motion with confidence interval radii added for probabilistic safety, a human simulator to collect human trajectory calibration dataset conditioned on the planned robot motion, and a conformal prediction module to quantify trajectory prediction error on the decision-dependent calibration dataset. Crowd navigation simulation experiments show that ICP strikes a good balance of performance among navigation efficiency, social awareness, and uncertainty quantification compared to previous works. ICP generalizes well to navigation tasks under various crowd densities. The fast runtime and efficient memory usage make ICP practical for real-world applications. Code is available at https://github.com/tedhuang96/icp. △ Less

Submitted 10 February, 2025; originally announced February 2025.

Comments: Accepted by WAFR 2024

arXiv:2501.12389 [pdf, other]

Taming Teacher Forcing for Masked Autoregressive Video Generation

Authors: Deyu Zhou, Quan Sun, Yuang Peng, Kun Yan, Runpei Dong, Duomin Wang, Zheng Ge, Nan Duan, Xiangyu Zhang, Lionel M. Ni, Heung-Yeung Shum

Abstract: We introduce MAGI, a hybrid video generation framework that combines masked modeling for intra-frame generation with causal modeling for next-frame generation. Our key innovation, Complete Teacher Forcing (CTF), conditions masked frames on complete observation frames rather than masked ones (namely Masked Teacher Forcing, MTF), enabling a smooth transition from token-level (patch-level) to frame-l… ▽ More We introduce MAGI, a hybrid video generation framework that combines masked modeling for intra-frame generation with causal modeling for next-frame generation. Our key innovation, Complete Teacher Forcing (CTF), conditions masked frames on complete observation frames rather than masked ones (namely Masked Teacher Forcing, MTF), enabling a smooth transition from token-level (patch-level) to frame-level autoregressive generation. CTF significantly outperforms MTF, achieving a +23% improvement in FVD scores on first-frame conditioned video prediction. To address issues like exposure bias, we employ targeted training strategies, setting a new benchmark in autoregressive video generation. Experiments show that MAGI can generate long, coherent video sequences exceeding 100 frames, even when trained on as few as 16 frames, highlighting its potential for scalable, high-quality video generation. △ Less

Submitted 21 January, 2025; originally announced January 2025.

Comments: 12 pages, 9 figures

arXiv:2411.10130 [pdf, other]

Towards Multi-View Consistent Style Transfer with One-Step Diffusion via Vision Conditioning

Authors: Yushen Zuo, Jun Xiao, Kin-Chung Chan, Rongkang Dong, Cuixin Yang, Zongqi He, Hao Xie, Kin-Man Lam

Abstract: The stylization of 3D scenes is an increasingly attractive topic in 3D vision. Although image style transfer has been extensively researched with promising results, directly applying 2D style transfer methods to 3D scenes often fails to preserve the structural and multi-view properties of 3D environments, resulting in unpleasant distortions in images from different viewpoints. To address these iss… ▽ More The stylization of 3D scenes is an increasingly attractive topic in 3D vision. Although image style transfer has been extensively researched with promising results, directly applying 2D style transfer methods to 3D scenes often fails to preserve the structural and multi-view properties of 3D environments, resulting in unpleasant distortions in images from different viewpoints. To address these issues, we leverage the remarkable generative prior of diffusion-based models and propose a novel style transfer method, OSDiffST, based on a pre-trained one-step diffusion model (i.e., SD-Turbo) for rendering diverse styles in multi-view images of 3D scenes. To efficiently adapt the pre-trained model for multi-view style transfer on small datasets, we introduce a vision condition module to extract style information from the reference style image to serve as conditional input for the diffusion model and employ LoRA in diffusion model for adaptation. Additionally, we consider color distribution alignment and structural similarity between the stylized and content images using two specific loss functions. As a result, our method effectively preserves the structural information and multi-view consistency in stylized images without any 3D information. Experiments show that our method surpasses other promising style transfer methods in synthesizing various styles for multi-view images of 3D scenes. Stylized images from different viewpoints generated by our method achieve superior visual quality, with better structural integrity and less distortion. The source code is available at https://github.com/YushenZuo/OSDiffST. △ Less

Submitted 15 November, 2024; originally announced November 2024.

Comments: Accepted by ECCV 2024 AI for Visual Arts Workshop and Challenges, 18 pages, 7 figures

arXiv:2410.15311 [pdf, other]

Who is Undercover? Guiding LLMs to Explore Multi-Perspective Team Tactic in the Game

Authors: Ruiqi Dong, Zhixuan Liao, Guangwei Lai, Yuhan Ma, Danni Ma, Chenyou Fan

Abstract: Large Language Models (LLMs) are pivotal AI agents in complex tasks but still face challenges in open decision-making problems within complex scenarios. To address this, we use the language logic game ``Who is Undercover?'' (WIU) as an experimental platform to propose the Multi-Perspective Team Tactic (MPTT) framework. MPTT aims to cultivate LLMs' human-like language expression logic, multi-dimens… ▽ More Large Language Models (LLMs) are pivotal AI agents in complex tasks but still face challenges in open decision-making problems within complex scenarios. To address this, we use the language logic game ``Who is Undercover?'' (WIU) as an experimental platform to propose the Multi-Perspective Team Tactic (MPTT) framework. MPTT aims to cultivate LLMs' human-like language expression logic, multi-dimensional thinking, and self-perception in complex scenarios. By alternating speaking and voting sessions, integrating techniques like self-perspective, identity-determination, self-reflection, self-summary and multi-round find-teammates, LLM agents make rational decisions through strategic concealment and communication, fostering human-like trust. Preliminary results show that MPTT, combined with WIU, leverages LLMs' cognitive capabilities to create a decision-making framework that can simulate real society. This framework aids minority groups in communication and expression, promoting fairness and diversity in decision-making. Additionally, our Human-in-the-loop experiments demonstrate that LLMs can learn and align with human behaviors through interactive, indicating their potential for active participation in societal decision-making. △ Less

Submitted 20 October, 2024; originally announced October 2024.

arXiv:2410.13125 [pdf, other]

Transformers4NewsRec: A Transformer-based News Recommendation Framework

Authors: Dairui Liu, Honghui Du, Boming Yang, Neil Hurley, Aonghus Lawlor, Irene Li, Derek Greene, Ruihai Dong

Abstract: Pre-trained transformer models have shown great promise in various natural language processing tasks, including personalized news recommendations. To harness the power of these models, we introduce Transformers4NewsRec, a new Python framework built on the \textbf{Transformers} library. This framework is designed to unify and compare the performance of various news recommendation models, including… ▽ More Pre-trained transformer models have shown great promise in various natural language processing tasks, including personalized news recommendations. To harness the power of these models, we introduce Transformers4NewsRec, a new Python framework built on the \textbf{Transformers} library. This framework is designed to unify and compare the performance of various news recommendation models, including deep neural networks and graph-based models. Transformers4NewsRec offers flexibility in terms of model selection, data preprocessing, and evaluation, allowing both quantitative and qualitative analysis. △ Less

Submitted 16 October, 2024; originally announced October 2024.

arXiv:2410.07952 [pdf, ps, other]

Eco-driving Incentive Mechanisms for Mitigating Emissions in Urban Transportation

Authors: M. Umar B. Niazi, Jung-Hoon Cho, Munther A. Dahleh, Roy Dong, Cathy Wu

Abstract: This paper proposes incentive mechanisms that promote eco-driving in transportation networks with the over-arching objective of minimizing emissions. The transportation system operator provides the drivers with energy-efficient driving guidance throughout their trips, and their eco-driving levels are measured by how closely they follow this guidance via vehicle telematics. Drivers choose their eco… ▽ More This paper proposes incentive mechanisms that promote eco-driving in transportation networks with the over-arching objective of minimizing emissions. The transportation system operator provides the drivers with energy-efficient driving guidance throughout their trips, and their eco-driving levels are measured by how closely they follow this guidance via vehicle telematics. Drivers choose their eco-driving levels to optimize a combination of their travel times and their emissions. To obtain optimal budget allocation and recommendations for the incentive mechanism, the system operator gathers drivers' preferences, or types, to assess each driver's trip urgency and natural willingness to eco-drive. In a setting where drivers truthfully report their types, we introduce the first-best incentive mechanism and show that the obedience condition holds (i.e., drivers find it optimal to comply with the system operator's recommendations) when the recommended eco-driving profile constitutes a Nash equilibrium. Moreover, in a setting where drivers can strategically report their types, we introduce the second-best incentive mechanism and show that the proposed mechanism is incentive-compatible (i.e., drivers find it optimal to be truthful). Under this mechanism, we also show that all equilibrium outcomes are at least as good as the recommended eco-driving profile in terms of the system operator's objective. Overall, this work offers a framework for designing eco-driving incentive mechanisms while considering both the strategic behavior of individual drivers and the network effects of collective decision-making. △ Less

Submitted 10 October, 2024; originally announced October 2024.

arXiv:2410.07216 [pdf, other]

Evaluating Financial Relational Graphs: Interpretation Before Prediction

Authors: Yingjie Niu, Lanxin Lu, Rian Dolphin, Valerio Poti, Ruihai Dong

Abstract: Accurate and robust stock trend forecasting has been a crucial and challenging task, as stock price changes are influenced by multiple factors. Graph neural network-based methods have recently achieved remarkable success in this domain by constructing stock relationship graphs that reflect internal factors and relationships between stocks. However, most of these methods rely on predefined factors… ▽ More Accurate and robust stock trend forecasting has been a crucial and challenging task, as stock price changes are influenced by multiple factors. Graph neural network-based methods have recently achieved remarkable success in this domain by constructing stock relationship graphs that reflect internal factors and relationships between stocks. However, most of these methods rely on predefined factors to construct static stock relationship graphs due to the lack of suitable datasets, failing to capture the dynamic changes in stock relationships. Moreover, the evaluation of relationship graphs in these methods is often tied to the performance of neural network models on downstream tasks, leading to confusion and imprecision. To address these issues, we introduce the SPNews dataset, collected based on S\&P 500 Index stocks, to facilitate the construction of dynamic relationship graphs. Furthermore, we propose a novel set of financial relationship graph evaluation methods that are independent of downstream tasks. By using the relationship graph to explain historical financial phenomena, we assess its validity before constructing a graph neural network, ensuring the graph's effectiveness in capturing relevant financial relationships. Experimental results demonstrate that our evaluation methods can effectively differentiate between various financial relationship graphs, yielding more interpretable results compared to traditional approaches. We make our source code publicly available on GitHub to promote reproducibility and further research in this area. △ Less

Submitted 28 September, 2024; originally announced October 2024.

Comments: Accepted by 2024 ACM International Conference on AI in Finance

ACM Class: I.2.4

arXiv:2410.04905 [pdf, ps, other]

Equations in wreath products

Authors: Laurent Bartholdi, Ruiwen Dong, Leon Pernak, Jan Philipp Wächter

Abstract: We survey solvability of equations in wreath products of groups, and prove that the quadratic diophantine problem is solvable in wreath products of Abelian groups. We consider the related question of determining commutator width, and prove that the quadratic diophantine problem is also solvable in Baumslag's finitely presented metabelian group. This text is a short version of an extensive article… ▽ More We survey solvability of equations in wreath products of groups, and prove that the quadratic diophantine problem is solvable in wreath products of Abelian groups. We consider the related question of determining commutator width, and prove that the quadratic diophantine problem is also solvable in Baumslag's finitely presented metabelian group. This text is a short version of an extensive article by the first-named authors. △ Less

Submitted 7 October, 2024; originally announced October 2024.

arXiv:2409.17228 [pdf, other]

Disk2Planet: A Robust and Automated Machine Learning Tool for Parameter Inference in Disk-Planet Systems

Authors: Shunyuan Mao, Ruobing Dong, Kwang Moo Yi, Lu Lu, Sifan Wang, Paris Perdikaris

Abstract: We introduce Disk2Planet, a machine learning-based tool to infer key parameters in disk-planet systems from observed protoplanetary disk structures. Disk2Planet takes as input the disk structures in the form of two-dimensional density and velocity maps, and outputs disk and planet properties, that is, the Shakura--Sunyaev viscosity, the disk aspect ratio, the planet--star mass ratio, and the plane… ▽ More We introduce Disk2Planet, a machine learning-based tool to infer key parameters in disk-planet systems from observed protoplanetary disk structures. Disk2Planet takes as input the disk structures in the form of two-dimensional density and velocity maps, and outputs disk and planet properties, that is, the Shakura--Sunyaev viscosity, the disk aspect ratio, the planet--star mass ratio, and the planet's radius and azimuth. We integrate the Covariance Matrix Adaptation Evolution Strategy (CMA--ES), an evolutionary algorithm tailored for complex optimization problems, and the Protoplanetary Disk Operator Network (PPDONet), a neural network designed to predict solutions of disk--planet interactions. Our tool is fully automated and can retrieve parameters in one system in three minutes on an Nvidia A100 graphics processing unit. We empirically demonstrate that our tool achieves percent-level or higher accuracy, and is able to handle missing data and unknown levels of noise. △ Less

Submitted 25 September, 2024; originally announced September 2024.

Comments: Accepted to ApJ

arXiv:2409.15045 [pdf, other]

AIM 2024 Sparse Neural Rendering Challenge: Methods and Results

Authors: Michal Nazarczuk, Sibi Catley-Chandar, Thomas Tanay, Richard Shaw, Eduardo Pérez-Pellitero, Radu Timofte, Xing Yan, Pan Wang, Yali Guo, Yongxin Wu, Youcheng Cai, Yanan Yang, Junting Li, Yanghong Zhou, P. Y. Mok, Zongqi He, Zhe Xiao, Kin-Chung Chan, Hana Lebeta Goshu, Cuixin Yang, Rongkang Dong, Jun Xiao, Kin-Man Lam, Jiayao Hao, Qiong Gao , et al. (5 additional authors not shown)

Abstract: This paper reviews the challenge on Sparse Neural Rendering that was part of the Advances in Image Manipulation (AIM) workshop, held in conjunction with ECCV 2024. This manuscript focuses on the competition set-up, the proposed methods and their respective results. The challenge aims at producing novel camera view synthesis of diverse scenes from sparse image observations. It is composed of two tr… ▽ More This paper reviews the challenge on Sparse Neural Rendering that was part of the Advances in Image Manipulation (AIM) workshop, held in conjunction with ECCV 2024. This manuscript focuses on the competition set-up, the proposed methods and their respective results. The challenge aims at producing novel camera view synthesis of diverse scenes from sparse image observations. It is composed of two tracks, with differing levels of sparsity; 3 views in Track 1 (very sparse) and 9 views in Track 2 (sparse). Participants are asked to optimise objective fidelity to the ground-truth images as measured via the Peak Signal-to-Noise Ratio (PSNR) metric. For both tracks, we use the newly introduced Sparse Rendering (SpaRe) dataset and the popular DTU MVS dataset. In this challenge, 5 teams submitted final results to Track 1 and 4 teams submitted final results to Track 2. The submitted models are varied and push the boundaries of the current state-of-the-art in sparse neural rendering. A detailed description of all models developed in the challenge is provided in this paper. △ Less

Submitted 23 September, 2024; originally announced September 2024.

Comments: Part of Advances in Image Manipulation workshop at ECCV 2024

arXiv:2409.13259 [pdf, other]

A generalizable framework for unlocking missing reactions in genome-scale metabolic networks using deep learning

Authors: Xiaoyi Liu, Hongpeng Yang, Chengwei Ai, Ruihan Dong, Yijie Ding, Qianqian Yuan, Jijun Tang, Fei Guo

Abstract: Incomplete knowledge of metabolic processes hinders the accuracy of GEnome-scale Metabolic models (GEMs), which in turn impedes advancements in systems biology and metabolic engineering. Existing gap-filling methods typically rely on phenotypic data to minimize the disparity between computational predictions and experimental results. However, there is still a lack of an automatic and precise gap-f… ▽ More Incomplete knowledge of metabolic processes hinders the accuracy of GEnome-scale Metabolic models (GEMs), which in turn impedes advancements in systems biology and metabolic engineering. Existing gap-filling methods typically rely on phenotypic data to minimize the disparity between computational predictions and experimental results. However, there is still a lack of an automatic and precise gap-filling method for initial state GEMs before experimental data and annotated genomes become available. In this study, we introduce CLOSEgaps, a deep learning-driven tool that addresses the gap-filling issue by modeling it as a hyperedge prediction problem within GEMs. Specifically, CLOSEgaps maps metabolic networks as hypergraphs and learns their hyper-topology features to identify missing reactions and gaps by leveraging hypothetical reactions. This innovative approach allows for the characterization and curation of both known and hypothetical reactions within metabolic networks. Extensive results demonstrate that CLOSEgaps accurately gap-filling over 96% of artificially introduced gaps for various GEMs. Furthermore, CLOSEgaps enhances phenotypic predictions for 24 GEMs and also finds a notable improvement in producing four crucial metabolites (Lactate, Ethanol, Propionate, and Succinate) in two organisms. As a broadly applicable solution for any GEM, CLOSEgaps represents a promising model to automate the gap-filling process and uncover missing connections between reactions and observed metabolic phenotypes. △ Less

Submitted 20 September, 2024; originally announced September 2024.

arXiv:2409.12396 [pdf, other]

ARTAI: An Evaluation Platform to Assess Societal Risk of Recommender Algorithms

Authors: Qin Ruan, Jin Xu, Ruihai Dong, Arjumand Younus, Tai Tan Mai, Barry O'Sullivan, Susan Leavy

Abstract: Societal risk emanating from how recommender algorithms disseminate content online is now well documented. Emergent regulation aims to mitigate this risk through ethical audits and enabling new research on the social impact of algorithms. However, there is currently a need for tools and methods that enable such evaluation. This paper presents ARTAI, an evaluation environment that enables large-sca… ▽ More Societal risk emanating from how recommender algorithms disseminate content online is now well documented. Emergent regulation aims to mitigate this risk through ethical audits and enabling new research on the social impact of algorithms. However, there is currently a need for tools and methods that enable such evaluation. This paper presents ARTAI, an evaluation environment that enables large-scale assessments of recommender algorithms to identify harmful patterns in how content is distributed online and enables the implementation of new regulatory requirements for increased transparency in recommender systems. △ Less

Submitted 18 September, 2024; originally announced September 2024.

Comments: 3 pages, 1 figure, accepted at FAccTRec 2024 Workshop, RecSys 2024

ACM Class: H.3.3; I.2.7; I.5.1

arXiv:2409.07077 [pdf, other]

Submonoid Membership in n-dimensional lamplighter groups and S-unit equations

Authors: Ruiwen Dong

Abstract: We show that Submonoid Membership is decidable in n-dimensional lamplighter groups $(\mathbb{Z}/p\mathbb{Z}) \wr \mathbb{Z}^n$ for any prime $p$ and integer $n$. More generally, we show decidability of Submonoid Membership in semidirect products of the form $\mathcal{Y} \rtimes \mathbb{Z}^n$, where $\mathcal{Y}$ is any finitely presented module over the Laurent polynomial ring… ▽ More We show that Submonoid Membership is decidable in n-dimensional lamplighter groups $(\mathbb{Z}/p\mathbb{Z}) \wr \mathbb{Z}^n$ for any prime $p$ and integer $n$. More generally, we show decidability of Submonoid Membership in semidirect products of the form $\mathcal{Y} \rtimes \mathbb{Z}^n$, where $\mathcal{Y}$ is any finitely presented module over the Laurent polynomial ring $\mathbb{F}_p[X_1^{\pm}, \ldots, X_n^{\pm}]$. Combined with a result of Shafrir (2024), this gives the first example of a group $G$ and a finite index subgroup $\widetilde{G} \leq G$, such that Submonoid Membership is decidable in $\widetilde{G}$ but undecidable in $G$. To obtain our decidability result, we reduce Submonoid Membership in $\mathcal{Y} \rtimes \mathbb{Z}^n$ to solving S-unit equations over $\mathbb{F}_p[X_1^{\pm}, \ldots, X_n^{\pm}]$-modules. We show that the solution set of such equations is effectively $p$-automatic, extending a result of Adamczewski and Bell (2012). As an intermediate result, we also obtain that the solution set of the Knapsack Problem in $\mathcal{Y} \rtimes \mathbb{Z}^n$ is effectively $p$-automatic. △ Less

Submitted 27 May, 2025; v1 submitted 11 September, 2024; originally announced September 2024.

Comments: Full version of conference paper at ICALP'25

arXiv:2408.11567 [pdf, ps, other]

Positional Prompt Tuning for Efficient 3D Representation Learning

Authors: Shaochen Zhang, Zekun Qi, Runpei Dong, Xiuxiu Bai, Xing Wei

Abstract: We rethink the role of positional encoding in 3D representation learning and fine-tuning. We argue that using positional encoding in point Transformer-based methods serves to aggregate multi-scale features of point clouds. Additionally, we explore parameter-efficient fine-tuning (PEFT) through the lens of prompts and adapters, introducing a straightforward yet effective method called PPT for point… ▽ More We rethink the role of positional encoding in 3D representation learning and fine-tuning. We argue that using positional encoding in point Transformer-based methods serves to aggregate multi-scale features of point clouds. Additionally, we explore parameter-efficient fine-tuning (PEFT) through the lens of prompts and adapters, introducing a straightforward yet effective method called PPT for point cloud analysis. PPT incorporates increased patch tokens and trainable positional encoding while keeping most pre-trained model parameters frozen. Extensive experiments validate that PPT is both effective and efficient. Our proposed method of PEFT tasks, namely PPT, with only 1.05M of parameters for training, gets state-of-the-art results in several mainstream datasets, such as 95.01% accuracy in the ScanObjectNN OBJ_BG dataset. Codes and weights will be released at https://github.com/zsc000722/PPT. △ Less

Submitted 23 September, 2025; v1 submitted 21 August, 2024; originally announced August 2024.

Comments: Accepted at ACMMM 2025 Oral

arXiv:2408.09460 [pdf, other]

Fine-Grained Building Function Recognition from Street-View Images via Geometry-Aware Semi-Supervised Learning

Authors: Weijia Li, Jinhua Yu, Dairong Chen, Yi Lin, Runmin Dong, Xiang Zhang, Conghui He, Haohuan Fu

Abstract: In this work, we propose a geometry-aware semi-supervised framework for fine-grained building function recognition, utilizing geometric relationships among multi-source data to enhance pseudo-label accuracy in semi-supervised learning, broadening its applicability to various building function categorization systems. Firstly, we design an online semi-supervised pre-training stage, which facilitates… ▽ More In this work, we propose a geometry-aware semi-supervised framework for fine-grained building function recognition, utilizing geometric relationships among multi-source data to enhance pseudo-label accuracy in semi-supervised learning, broadening its applicability to various building function categorization systems. Firstly, we design an online semi-supervised pre-training stage, which facilitates the precise acquisition of building facade location information in street-view images. In the second stage, we propose a geometry-aware coarse annotation generation module. This module effectively combines GIS data and street-view data based on the geometric relationships, improving the accuracy of pseudo annotations. In the third stage, we combine the newly generated coarse annotations with the existing labeled dataset to achieve fine-grained functional recognition of buildings across multiple cities at a large scale. Extensive experiments demonstrate that our proposed framework exhibits superior performance in fine-grained functional recognition of buildings. Within the same categorization system, it achieves improvements of 7.6\% and 4.8\% compared to fully-supervised methods and state-of-the-art semi-supervised methods, respectively. Additionally, our method also performs well in cross-city scenarios, i.e., extending the model trained on OmniCity (New York) to new cities (i.e., Los Angeles and Boston) with different building function categorization systems. This study offers a new solution for large-scale multi-city applications with minimal annotation requirements, facilitating more efficient data updates and resource allocation in urban management. △ Less

Submitted 8 September, 2024; v1 submitted 18 August, 2024; originally announced August 2024.

Comments: This paper is currently under review

arXiv:2408.07527 [pdf, other]

Evidential Graph Contrastive Alignment for Source-Free Blending-Target Domain Adaptation

Authors: Juepeng Zheng, Yibin Wen, Jinxiao Zhang, Runmin Dong, Haohuan Fu

Abstract: In this paper, we firstly tackle a more realistic Domain Adaptation (DA) setting: Source-Free Blending-Target Domain Adaptation (SF-BTDA), where we can not access to source domain data while facing mixed multiple target domains without any domain labels in prior. Compared to existing DA scenarios, SF-BTDA generally faces the co-existence of different label shifts in different targets, along with n… ▽ More In this paper, we firstly tackle a more realistic Domain Adaptation (DA) setting: Source-Free Blending-Target Domain Adaptation (SF-BTDA), where we can not access to source domain data while facing mixed multiple target domains without any domain labels in prior. Compared to existing DA scenarios, SF-BTDA generally faces the co-existence of different label shifts in different targets, along with noisy target pseudo labels generated from the source model. In this paper, we propose a new method called Evidential Contrastive Alignment (ECA) to decouple the blending target domain and alleviate the effect from noisy target pseudo labels. First, to improve the quality of pseudo target labels, we propose a calibrated evidential learning module to iteratively improve both the accuracy and certainty of the resulting model and adaptively generate high-quality pseudo target labels. Second, we design a graph contrastive learning with the domain distance matrix and confidence-uncertainty criterion, to minimize the distribution gap of samples of a same class in the blended target domains, which alleviates the co-existence of different label shifts in blended targets. We conduct a new benchmark based on three standard DA datasets and ECA outperforms other methods with considerable gains and achieves comparable results compared with those that have domain labels or source data in prior. △ Less

Submitted 25 August, 2024; v1 submitted 14 August, 2024; originally announced August 2024.

arXiv:2407.18645 [pdf, other]

Contrastive Learning of Asset Embeddings from Financial Time Series

Authors: Rian Dolphin, Barry Smyth, Ruihai Dong

Abstract: Representation learning has emerged as a powerful paradigm for extracting valuable latent features from complex, high-dimensional data. In financial domains, learning informative representations for assets can be used for tasks like sector classification, and risk management. However, the complex and stochastic nature of financial markets poses unique challenges. We propose a novel contrastive lea… ▽ More Representation learning has emerged as a powerful paradigm for extracting valuable latent features from complex, high-dimensional data. In financial domains, learning informative representations for assets can be used for tasks like sector classification, and risk management. However, the complex and stochastic nature of financial markets poses unique challenges. We propose a novel contrastive learning framework to generate asset embeddings from financial time series data. Our approach leverages the similarity of asset returns over many subwindows to generate informative positive and negative samples, using a statistical sampling strategy based on hypothesis testing to address the noisy nature of financial data. We explore various contrastive loss functions that capture the relationships between assets in different ways to learn a discriminative representation space. Experiments on real-world datasets demonstrate the effectiveness of the learned asset embeddings on benchmark industry classification and portfolio optimization tasks. In each case our novel approaches significantly outperform existing baselines highlighting the potential for contrastive learning to capture meaningful and actionable relationships in financial data. △ Less

Submitted 26 July, 2024; originally announced July 2024.

Comments: 9 pages, 4 figures, 4 tables

arXiv:2407.18472 [pdf, other]

doi 10.1145/3626772.3657941

FedUD: Exploiting Unaligned Data for Cross-Platform Federated Click-Through Rate Prediction

Authors: Wentao Ouyang, Rui Dong, Ri Tao, Xiangzheng Liu

Abstract: Click-through rate (CTR) prediction plays an important role in online advertising platforms. Most existing methods use data from the advertising platform itself for CTR prediction. As user behaviors also exist on many other platforms, e.g., media platforms, it is beneficial to further exploit such complementary information for better modeling user interest and for improving CTR prediction performa… ▽ More Click-through rate (CTR) prediction plays an important role in online advertising platforms. Most existing methods use data from the advertising platform itself for CTR prediction. As user behaviors also exist on many other platforms, e.g., media platforms, it is beneficial to further exploit such complementary information for better modeling user interest and for improving CTR prediction performance. However, due to privacy concerns, data from different platforms cannot be uploaded to a server for centralized model training. Vertical federated learning (VFL) provides a possible solution which is able to keep the raw data on respective participating parties and learn a collaborative model in a privacy-preserving way. However, traditional VFL methods only utilize aligned data with common keys across parties, which strongly restricts their application scope. In this paper, we propose FedUD, which is able to exploit unaligned data, in addition to aligned data, for more accurate federated CTR prediction. FedUD contains two steps. In the first step, FedUD utilizes aligned data across parties like traditional VFL, but it additionally includes a knowledge distillation module. This module distills useful knowledge from the guest party's high-level representations and guides the learning of a representation transfer network. In the second step, FedUD applies the learned knowledge to enrich the representations of the host party's unaligned data such that both aligned and unaligned data can contribute to federated model training. Experiments on two real-world datasets demonstrate the superior performance of FedUD for federated CTR prediction. △ Less

Submitted 25 July, 2024; originally announced July 2024.

arXiv:2407.05352 [pdf, other]

Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model

Authors: Danni Yang, Ruohan Dong, Jiayi Ji, Yiwei Ma, Haowei Wang, Xiaoshuai Sun, Rongrong Ji

Abstract: Recently, diffusion models have increasingly demonstrated their capabilities in vision understanding. By leveraging prompt-based learning to construct sentences, these models have shown proficiency in classification and visual grounding tasks. However, existing approaches primarily showcase their ability to perform sentence-level localization, leaving the potential for leveraging contextual inform… ▽ More Recently, diffusion models have increasingly demonstrated their capabilities in vision understanding. By leveraging prompt-based learning to construct sentences, these models have shown proficiency in classification and visual grounding tasks. However, existing approaches primarily showcase their ability to perform sentence-level localization, leaving the potential for leveraging contextual information for phrase-level understanding largely unexplored. In this paper, we utilize Panoptic Narrative Grounding (PNG) as a proxy task to investigate this capability further. PNG aims to segment object instances mentioned by multiple noun phrases within a given narrative text. Specifically, we introduce the DiffPNG framework, a straightforward yet effective approach that fully capitalizes on the diffusion's architecture for segmentation by decomposing the process into a sequence of localization, segmentation, and refinement steps. The framework initially identifies anchor points using cross-attention mechanisms and subsequently performs segmentation with self-attention to achieve zero-shot PNG. Moreover, we introduce a refinement module based on SAM to enhance the quality of the segmentation masks. Our extensive experiments on the PNG dataset demonstrate that DiffPNG achieves strong performance in the zero-shot PNG task setting, conclusively proving the diffusion model's capability for context-aware, phrase-level understanding. Source code is available at \url{https://github.com/nini0919/DiffPNG}. △ Less

Submitted 7 July, 2024; originally announced July 2024.

Comments: Accepted by ECCV2024

arXiv:2406.16855 [pdf, other]

DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation

Authors: Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Chunrui Han, Zheng Ge, Xiangyu Zhang, Shu-Tao Xia

Abstract: Personalized image generation holds great promise in assisting humans in everyday work and life due to its impressive ability to creatively generate personalized content across various contexts. However, current evaluations either are automated but misalign with humans or require human evaluations that are time-consuming and expensive. In this work, we present DreamBench++, a human-aligned benchma… ▽ More Personalized image generation holds great promise in assisting humans in everyday work and life due to its impressive ability to creatively generate personalized content across various contexts. However, current evaluations either are automated but misalign with humans or require human evaluations that are time-consuming and expensive. In this work, we present DreamBench++, a human-aligned benchmark that advanced multimodal GPT models automate. Specifically, we systematically design the prompts to let GPT be both human-aligned and self-aligned, empowered with task reinforcement. Further, we construct a comprehensive dataset comprising diverse images and prompts. By benchmarking 7 modern generative models, we demonstrate that DreamBench++ results in significantly more human-aligned evaluation, helping boost the community with innovative findings. △ Less

Submitted 8 March, 2025; v1 submitted 24 June, 2024; originally announced June 2024.

Comments: ICLR 2025, Project page: https://dreambenchplus.github.io/

arXiv:2406.16439 [pdf, ps, other]

Exploring Test-Time Adaptation for Object Detection in Continually Changing Environments

Authors: Shilei Cao, Juepeng Zheng, Yan Liu, Baoquan Zhao, Ziqi Yuan, Weijia Li, Runmin Dong, Haohuan Fu

Abstract: Real-world application models are commonly deployed in dynamic environments, where the target domain distribution undergoes temporal changes. Continual Test-Time Adaptation (CTTA) has recently emerged as a promising technique to gradually adapt a source-trained model to continually changing target domains. Despite recent advancements in addressing CTTA, two critical issues remain: 1) Fixed thresho… ▽ More Real-world application models are commonly deployed in dynamic environments, where the target domain distribution undergoes temporal changes. Continual Test-Time Adaptation (CTTA) has recently emerged as a promising technique to gradually adapt a source-trained model to continually changing target domains. Despite recent advancements in addressing CTTA, two critical issues remain: 1) Fixed thresholds for pseudo-labeling in existing methodologies lead to low-quality pseudo-labels, as model confidence varies across categories and domains; 2) Stochastic parameter restoration methods for mitigating catastrophic forgetting fail to preserve critical information effectively, due to their intrinsic randomness. To tackle these challenges for detection models in CTTA scenarios, we present AMROD, featuring three core components. Firstly, the object-level contrastive learning module extracts object-level features for contrastive learning to refine the feature representation in the target domain. Secondly, the adaptive monitoring module dynamically skips unnecessary adaptation and updates the category-specific threshold based on predicted confidence scores to enable efficiency and improve the quality of pseudo-labels. Lastly, the adaptive randomized restoration mechanism selectively reset inactive parameters with higher possibilities, ensuring the retention of essential knowledge. We demonstrate the effectiveness of AMROD on four CTTA object detection tasks, where AMROD outperforms existing methods, especially achieving a 3.2 mAP improvement and a 20\% increase in efficiency on the Cityscapes-to-Cityscapes-C CTTA task. The code of this work is available at https://github.com/ShileiCao/AMROD. △ Less

Submitted 11 June, 2025; v1 submitted 24 June, 2024; originally announced June 2024.

arXiv:2406.11657 [pdf, other]

Can LLM be a Personalized Judge?

Authors: Yijiang River Dong, Tiancheng Hu, Nigel Collier

Abstract: Ensuring that large language models (LLMs) reflect diverse user values and preferences is crucial as their user bases expand globally. It is therefore encouraging to see the growing interest in LLM personalization within the research community. However, current works often rely on the LLM-as-a-Judge approach for evaluation without thoroughly examining its validity. In this paper, we investigate th… ▽ More Ensuring that large language models (LLMs) reflect diverse user values and preferences is crucial as their user bases expand globally. It is therefore encouraging to see the growing interest in LLM personalization within the research community. However, current works often rely on the LLM-as-a-Judge approach for evaluation without thoroughly examining its validity. In this paper, we investigate the reliability of LLM-as-a-Personalized-Judge, asking LLMs to judge user preferences based on personas. Our findings suggest that directly applying LLM-as-a-Personalized-Judge is less reliable than previously assumed, showing low and inconsistent agreement with human ground truth. The personas typically used are often overly simplistic, resulting in low predictive power. To address these issues, we introduce verbal uncertainty estimation into the LLM-as-a-Personalized-Judge pipeline, allowing the model to express low confidence on uncertain judgments. This adjustment leads to much higher agreement (above 80%) on high-certainty samples for binary tasks. Through human evaluation, we find that the LLM-as-a-Personalized-Judge achieves comparable performance to third-party humans evaluation and even surpasses human performance on high-certainty samples. Our work indicates that certainty-enhanced LLM-as-a-Personalized-Judge offers a promising direction for developing more reliable and scalable methods for evaluating LLM personalization. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: Our code is available at https://github.com/dong-river/Personalized-Judge

arXiv:2406.10869 [pdf, other]

Geometric Distortion Guided Transformer for Omnidirectional Image Super-Resolution

Authors: Cuixin Yang, Rongkang Dong, Jun Xiao, Cong Zhang, Kin-Man Lam, Fei Zhou, Guoping Qiu

Abstract: As virtual and augmented reality applications gain popularity, omnidirectional image (ODI) super-resolution has become increasingly important. Unlike 2D plain images that are formed on a plane, ODIs are projected onto spherical surfaces. Applying established image super-resolution methods to ODIs, therefore, requires performing equirectangular projection (ERP) to map the ODIs onto a plane. ODI sup… ▽ More As virtual and augmented reality applications gain popularity, omnidirectional image (ODI) super-resolution has become increasingly important. Unlike 2D plain images that are formed on a plane, ODIs are projected onto spherical surfaces. Applying established image super-resolution methods to ODIs, therefore, requires performing equirectangular projection (ERP) to map the ODIs onto a plane. ODI super-resolution needs to take into account geometric distortion resulting from ERP. However, without considering such geometric distortion of ERP images, previous deep-learning-based methods only utilize a limited range of pixels and may easily miss self-similar textures for reconstruction. In this paper, we introduce a novel Geometric Distortion Guided Transformer for Omnidirectional image Super-Resolution (GDGT-OSR). Specifically, a distortion modulated rectangle-window self-attention mechanism, integrated with deformable self-attention, is proposed to better perceive the distortion and thus involve more self-similar textures. Distortion modulation is achieved through a newly devised distortion guidance generator that produces guidance by exploiting the variability of distortion across latitudes. Furthermore, we propose a dynamic feature aggregation scheme to adaptively fuse the features from different self-attention modules. We present extensive experimental results on public datasets and show that the new GDGT-OSR outperforms methods in existing literature. △ Less

Submitted 16 January, 2025; v1 submitted 16 June, 2024; originally announced June 2024.

Comments: 13 pages, 12 figures, journal

arXiv:2406.08480 [pdf, ps, other]

Linear equations with monomial constraints and decision problems in abelian-by-cyclic groups

Authors: Ruiwen Dong

Abstract: We show that it is undecidable whether a system of linear equations over the Laurent polynomial ring $\mathbb{Z}[X^{\pm}]$ admit solutions where a specified subset of variables take value in the set of monomials $\{X^z \mid z \in \mathbb{Z}\}$. In particular, we construct a finitely presented $\mathbb{Z}[X^{\pm}]$-module, where it is undecidable whether a linear equation… ▽ More We show that it is undecidable whether a system of linear equations over the Laurent polynomial ring $\mathbb{Z}[X^{\pm}]$ admit solutions where a specified subset of variables take value in the set of monomials $\{X^z \mid z \in \mathbb{Z}\}$. In particular, we construct a finitely presented $\mathbb{Z}[X^{\pm}]$-module, where it is undecidable whether a linear equation $X^{z_1} \boldsymbol{f}_1 + \cdots + X^{z_n} \boldsymbol{f}_n = \boldsymbol{f}_0$ has solutions $z_1, \ldots, z_n \in \mathbb{Z}$. This contrasts the decidability of the case $n = 1$, which can be deduced from Noskov's Lemma. We apply this result to settle a number of problems in computational group theory. We show that it is undecidable whether a system of equations has solutions in the wreath product $\mathbb{Z} \wr \mathbb{Z}$, providing a negative answer to an open problem of Kharlampovich, López and Miasnikov (2020). We show that there exists a finitely generated abelian-by-cyclic group in which the problem of solving a single quadratic equation is undecidable. We also construct a finitely generated abelian-by-cyclic group, different to that of Mishchenko and Treier (2017), in which the Knapsack Problem is undecidable. In contrast, we show that the problem of Coset Intersection is decidable in all finitely generated abelian-by-cyclic groups. △ Less

Submitted 6 September, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

Comments: Corrected an error in Lemma 6.8. Supersedes arXiv:2309.08811

Showing 1–50 of 178 results for author: Dong, R