Search | arXiv e-print repository

arXiv:2506.20650 [pdf, ps, other]

Mastering Multiple-Expert Routing: Realizable $H$-Consistency and Strong Guarantees for Learning to Defer

Authors: Anqi Mao, Mehryar Mohri, Yutao Zhong

Abstract: The problem of learning to defer with multiple experts consists of optimally assigning input instances to experts, balancing the trade-off between their accuracy and computational cost. This is a critical challenge in natural language generation, but also in other fields such as image processing, and medical diagnostics. Recent studies have proposed surrogate loss functions to optimize deferral, b… ▽ More The problem of learning to defer with multiple experts consists of optimally assigning input instances to experts, balancing the trade-off between their accuracy and computational cost. This is a critical challenge in natural language generation, but also in other fields such as image processing, and medical diagnostics. Recent studies have proposed surrogate loss functions to optimize deferral, but challenges remain in ensuring their consistency properties. This paper introduces novel surrogate loss functions and efficient algorithms with strong theoretical learning guarantees. We address open questions regarding realizable $H$-consistency, $H$-consistency bounds, and Bayes-consistency for both single-stage (jointly learning predictor and deferral function) and two-stage (learning only the deferral function with a fixed expert) learning scenarios. For single-stage deferral, we introduce a family of new realizable $H$-consistent surrogate losses and further prove $H$-consistency for a selected member. For two-stage deferral, we derive new surrogate losses that achieve realizable $H$-consistency, $H$-consistency bounds, and Bayes-consistency for the two-expert scenario and, under natural assumptions, multiple-expert scenario. Additionally, we provide enhanced theoretical guarantees under low-noise assumptions for both scenarios. Finally, we report the results of experiments using our proposed surrogate losses, comparing their performance against existing baselines. △ Less

Submitted 25 June, 2025; originally announced June 2025.

Comments: ICML 2025

arXiv:2506.18968 [pdf, ps, other]

doi 10.1093/mnras/staf1000

A new window into the sub-parsec scale magnetic field in the Milky Way? Unveiling small-scale magneto-ionic structures with Faraday complexity

Authors: Yik Ki Ma, Amit Seta, N. M. McClure-Griffiths, C. L. Van Eck, S. A. Mao, A. Ordog, J. C. Brown, T. O. Kovacs, Takuya Akahori, K. Kurahara, L. Oberhelman, C. S. Anderson

Abstract: Radio broadband spectro-polarimetric observations are sensitive to the spatial fluctuations of the Faraday depth (FD) within the telescope beam. Such FD fluctuations are referred to as "Faraday complexity", and can unveil small-scale magneto-ionic structures in both the synchrotron-emitting and the foreground volumes. We explore the astrophysical origin of the Faraday complexity exhibited by 191 p… ▽ More Radio broadband spectro-polarimetric observations are sensitive to the spatial fluctuations of the Faraday depth (FD) within the telescope beam. Such FD fluctuations are referred to as "Faraday complexity", and can unveil small-scale magneto-ionic structures in both the synchrotron-emitting and the foreground volumes. We explore the astrophysical origin of the Faraday complexity exhibited by 191 polarised extragalactic radio sources (EGSs) within 5 deg from the Galactic plane in the longitude range of 20-52 deg, using broadband data from the Karl G. Jansky Very Large Array presented by a previous work. A new parameter called the FD spread is devised to quantify the spatial FD fluctuations. We find that the FD spread of the EGSs (i) demonstrates an enhancement near the Galactic mid-plane, most notable within Galactic latitude of +-3 deg, (ii) exhibits hints of modulations across Galactic longitude, (iii) does not vary with the source size across the entire range of 2.5"-300", and (iv) has an amplitude higher than expected from magneto-ionic structures of extragalactic origin. All these suggest that the primary cause of the Faraday complexity exhibited by our target EGSs is <2.5"-scale magneto-ionic structures in the Milky Way. We argue that the anisotropic turbulent magnetic field generated by galactic-scale shocks and shears, or the stellar feedback-driven isotropic turbulent magnetic field, are the most likely candidates. Our work highlights the use of broadband radio polarimetric observations of EGSs as a powerful probe of multi-scale magnetic structures in the Milky Way. △ Less

Submitted 23 June, 2025; originally announced June 2025.

Comments: 32 pages, 19 figures, MNRAS accepted

arXiv:2506.08260 [pdf, ps, other]

Automatic Generation of Inference Making Questions for Reading Comprehension Assessments

Authors: Wanjing Anya Ma, Michael Flor, Zuowei Wang

Abstract: Inference making is an essential but complex skill in reading comprehension (RC). Some inferences require resolving references across sentences, and some rely on using prior knowledge to fill in the detail that is not explicitly written in the text. Diagnostic RC questions can help educators provide more effective and targeted reading instruction and interventions for school-age students. We intro… ▽ More Inference making is an essential but complex skill in reading comprehension (RC). Some inferences require resolving references across sentences, and some rely on using prior knowledge to fill in the detail that is not explicitly written in the text. Diagnostic RC questions can help educators provide more effective and targeted reading instruction and interventions for school-age students. We introduce a taxonomy of inference types for RC and use it to analyze the distribution of items within a diagnostic RC item bank. Next, we present experiments using GPT-4o to generate bridging-inference RC items for given reading passages via few-shot prompting, comparing conditions with and without chain-of-thought prompts. Generated items were evaluated on three aspects: overall item quality, appropriate inference type, and LLM reasoning, achieving high inter-rater agreements above 0.90. Our results show that GPT-4o produced 93.8% good-quality questions suitable for operational use in grade 3-12 contexts; however, only 42.6% of the generated questions accurately matched the targeted inference type. We conclude that combining automatic item generation with human judgment offers a promising path toward scalable, high-quality diagnostic RC assessments. △ Less

Submitted 9 June, 2025; originally announced June 2025.

Comments: Accepted to the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025), co-located with the ACL 2025

arXiv:2506.07048 [pdf, ps, other]

Dimensionless Hierarchical Topological Phononic States

Authors: Joel R. Pyfrom, Kai Sun, Jihong A. Ma

Abstract: Topological insulators exhibit unique boundary states that are protected by the topology of the bulk bands, a phenomenon that has now been extended to classical systems such as phononics and mechanics. Typically, nontrivial topology in an $n$-dimensional bulk leads to the emergence of $(n-1)$-dimensional topologically protected boundary states. However, these states can often be gapped out by brea… ▽ More Topological insulators exhibit unique boundary states that are protected by the topology of the bulk bands, a phenomenon that has now been extended to classical systems such as phononics and mechanics. Typically, nontrivial topology in an $n$-dimensional bulk leads to the emergence of $(n-1)$-dimensional topologically protected boundary states. However, these states can often be gapped out by breaking the symmetry that protects them, resulting in the possible creation of new in-gap higher-order topological modes. A notable example of this is the higher-order topological insulator (HOTI), where gapping out surface states leads to the formation of lower-dimensional topological modes, such as hinge or corner states. This process reduces the spatial dimensionality of the protected modes from $(n-1)$ to $(n-2)$ or even lower. In this work, we propose an alternative method to achieve higher-order topological modes using a one-dimensional Su-Schrieffer-Heeger model. Instead of relying on dimensional reduction, we manipulate the positions of domain walls to gap out the originally topologically protected domain-wall states, thereby inducing new higher-order topological states. These new higher-order topological states can be characterized using a generalized winding number calculation. This approach allows for the realization of multiple (and even infinite) topological orders within simple 1D lattices while maintaining the principle of bulk-boundary correspondence. Our study reveals a new mechanism that enriches topological hierarchies beyond conventional classifications. Such a mechanism could also be extended to higher dimensions, potentially creating intricate networks of topological states and advancing our control over wave phenomena. △ Less

Submitted 8 June, 2025; originally announced June 2025.

Comments: 15 pages, 9 figures

arXiv:2505.20692 [pdf, ps, other]

Can we Debias Social Stereotypes in AI-Generated Images? Examining Text-to-Image Outputs and User Perceptions

Authors: Saharsh Barve, Andy Mao, Jiayue Melissa Shi, Prerna Juneja, Koustuv Saha

Abstract: Recent advances in generative AI have enabled visual content creation through text-to-image (T2I) generation. However, despite their creative potential, T2I models often replicate and amplify societal stereotypes -- particularly those related to gender, race, and culture -- raising important ethical concerns. This paper proposes a theory-driven bias detection rubric and a Social Stereotype Index (… ▽ More Recent advances in generative AI have enabled visual content creation through text-to-image (T2I) generation. However, despite their creative potential, T2I models often replicate and amplify societal stereotypes -- particularly those related to gender, race, and culture -- raising important ethical concerns. This paper proposes a theory-driven bias detection rubric and a Social Stereotype Index (SSI) to systematically evaluate social biases in T2I outputs. We audited three major T2I model outputs -- DALL-E-3, Midjourney-6.1, and Stability AI Core -- using 100 queries across three categories -- geocultural, occupational, and adjectival. Our analysis reveals that initial outputs are prone to include stereotypical visual cues, including gendered professions, cultural markers, and western beauty norms. To address this, we adopted our rubric to conduct targeted prompt refinement using LLMs, which significantly reduced bias -- SSI dropped by 61% for geocultural, 69% for occupational, and 51% for adjectival queries. We complemented our quantitative analysis through a user study examining perceptions, awareness, and preferences around AI-generated biased imagery. Our findings reveal a key tension -- although prompt refinement can mitigate stereotypes, it can limit contextual alignment. Interestingly, users often perceived stereotypical images to be more aligned with their expectations. We discuss the need to balance ethical debiasing with contextual relevance and call for T2I systems that support global diversity and inclusivity while not compromising the reflection of real-world social complexity. △ Less

Submitted 27 May, 2025; originally announced May 2025.

arXiv:2505.08272 [pdf, ps, other]

The Polarisation Sky Survey of the Universe's Magnetism (POSSUM): Science Goals and Survey Description

Authors: B. M. Gaensler, G. H. Heald, N. M. McClure-Griffiths, C. S. Anderson, C. L. Van Eck, J. L. West, A. J. M. Thomson, J. P. Leahy, L. Rudnick, Y. K. Ma, Takuya Akahori, G. Gürkan, T. L. Landecker, S. A. Mao, S. P. O'Sullivan, W. Raja, X. Sun, T. Vernstrom, Lerato Baidoo, Ettore Carretti, A. R. Taylor, A. G. Willis, Erik Osinga, J. D. Livingston, E. L. Alexander , et al. (35 additional authors not shown)

Abstract: The Australian SKA Pathfinder (ASKAP) offers powerful new capabilities for studying the polarised and magnetised Universe at radio wavelengths. In this paper, we introduce the Polarisation Sky Survey of the Universe's Magnetism (POSSUM), a groundbreaking survey with three primary objectives: (1) to create a comprehensive Faraday rotation measure (RM) grid of up to one million compact extragalactic… ▽ More The Australian SKA Pathfinder (ASKAP) offers powerful new capabilities for studying the polarised and magnetised Universe at radio wavelengths. In this paper, we introduce the Polarisation Sky Survey of the Universe's Magnetism (POSSUM), a groundbreaking survey with three primary objectives: (1) to create a comprehensive Faraday rotation measure (RM) grid of up to one million compact extragalactic sources across the southern ~50 per cent of the sky (20,630 deg$^2$); (2) to map the intrinsic polarisation and RM properties of a wide range of discrete extragalactic and Galactic objects over the same area; and (3) to contribute interferometric data with excellent surface brightness sensitivity, which can be combined with single-dish data to study the diffuse Galactic interstellar medium. Observations for the full POSSUM survey commenced in May 2023 and are expected to conclude by mid-2028. POSSUM will achieve an RM grid density of around 30-50 RMs per square degree with a median measurement uncertainty of ~1 rad m$^{-2}$. The survey operates primarily over a frequency range of 800-1088 MHz, with an angular resolution of 20'' and a typical RMS sensitivity in Stokes $Q$ or $U$ of 18 $μ$Jy beam$^{-1}$. Additionally, the survey will be supplemented by similar observations covering 1296-1440 MHz over 38 per cent of the sky. POSSUM will enable the discovery and detailed investigation of magnetised phenomena in a wide range of cosmic environments, as well as the interplay between these components. This paper reviews the current science case developed by the POSSUM Collaboration and provides an overview of POSSUM's observations, data processing, outputs, and its complementarity with other radio and multi-wavelength surveys, including future work with the SKA. [Abstract abridged] △ Less

Submitted 13 May, 2025; originally announced May 2025.

Comments: Accepted for publication in PASA. 32 pages, 9 figures, 1 table

arXiv:2505.02361 [pdf, other]

Learning simple heuristic rules for classifying materials based on chemical composition

Authors: Andrew Ma, Marin Soljačić

Abstract: In the past decade, there has been a significant interest in the use of machine learning approaches in materials science research. Conventional deep learning approaches that rely on complex, nonlinear models have become increasingly important in computational materials science due to their high predictive accuracy. In contrast to these approaches, we have shown in a recent work that a remarkably s… ▽ More In the past decade, there has been a significant interest in the use of machine learning approaches in materials science research. Conventional deep learning approaches that rely on complex, nonlinear models have become increasingly important in computational materials science due to their high predictive accuracy. In contrast to these approaches, we have shown in a recent work that a remarkably simple learned heuristic rule -- based on the concept of topogivity -- can classify whether a material is topological using only its chemical composition. In this paper, we go beyond the topology classification scenario by also studying the use of machine learning to develop simple heuristic rules for classifying whether a material is a metal based on chemical composition. Moreover, we present a framework for incorporating chemistry-informed inductive bias based on the structure of the periodic table. For both the topology classification and the metallicity classification tasks, we empirically characterize the performance of simple heuristic rules fit with and without chemistry-informed inductive bias across a wide range of training set sizes. We find evidence that incorporating chemistry-informed inductive bias can reduce the amount of training data required to reach a given level of test accuracy. △ Less

Submitted 5 May, 2025; originally announced May 2025.

Comments: 10 pages, 3 figures

arXiv:2505.02222 [pdf, other]

Practical Efficiency of Muon for Pretraining

Authors: Essential AI, :, Ishaan Shah, Anthony M. Polloreno, Karl Stratos, Philip Monk, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Ashish Tanwer, Darsh J Shah, Khoi Nguyen, Kurt Smith, Michael Callahan, Michael Pust, Mohit Parmar, Peter Rushton, Platon Mazarakis, Ritvik Kapila, Saurabh Srivastava, Somanshu Singla, Tim Romanski, Yash Vanjani, Ashish Vaswani

Abstract: We demonstrate that Muon, the simplest instantiation of a second-order optimizer, explicitly expands the Pareto frontier over AdamW on the compute-time tradeoff. We find that Muon is more effective than AdamW in retaining data efficiency at large batch sizes, far beyond the so-called critical batch size, while remaining computationally efficient, thus enabling more economical training. We study th… ▽ More We demonstrate that Muon, the simplest instantiation of a second-order optimizer, explicitly expands the Pareto frontier over AdamW on the compute-time tradeoff. We find that Muon is more effective than AdamW in retaining data efficiency at large batch sizes, far beyond the so-called critical batch size, while remaining computationally efficient, thus enabling more economical training. We study the combination of Muon and the maximal update parameterization (muP) for efficient hyperparameter transfer and present a simple telescoping algorithm that accounts for all sources of error in muP while introducing only a modest overhead in resources. We validate our findings through extensive experiments with model sizes up to four billion parameters and ablations on the data distribution and architecture. △ Less

Submitted 19 May, 2025; v1 submitted 4 May, 2025; originally announced May 2025.

arXiv:2505.00258 [pdf, ps, other]

Quantile-RK and Double Quantile-RK Error Horizon Analysis

Authors: Emeric Battaglia, Anna Ma

Abstract: In solving linear systems of equations of the form $Ax=b$, corruptions present in $b$ affect stochastic iterative algorithms' ability to reach the true solution $x^\ast$ to the uncorrupted linear system. The randomized Kaczmarz method converges in expectation to $x^\ast$ up to an error horizon dependent on the conditioning of $A$ and the supremum norm of the corruption in $b$. To avoid this error… ▽ More In solving linear systems of equations of the form $Ax=b$, corruptions present in $b$ affect stochastic iterative algorithms' ability to reach the true solution $x^\ast$ to the uncorrupted linear system. The randomized Kaczmarz method converges in expectation to $x^\ast$ up to an error horizon dependent on the conditioning of $A$ and the supremum norm of the corruption in $b$. To avoid this error horizon in the sparse corruption setting, previous works have proposed quantile-based adaptations that make iterative methods robust. Our work first establishes a new convergence rate for the quantile-based random Kaczmarz (qRK) and double quantile-based random Kaczmarz (dqRK) methods, which, under mild conditions, improves upon known bounds. We further consider the more practical setting in which the vector $b$ includes both non-sparse "noise" and sparse "corruption". Error horizon bounds for qRK and dqRK are derived and shown to produce a smaller error horizon compared to their non-quantile-based counterparts, further demonstrating the advantages of quantile-based methods. △ Less

Submitted 30 April, 2025; originally announced May 2025.

MSC Class: 65F10; 65F20

arXiv:2504.13092 [pdf, other]

EventVAD: Training-Free Event-Aware Video Anomaly Detection

Authors: Yihua Shao, Haojin He, Sijie Li, Siyu Chen, Xinwei Long, Fanhu Zeng, Yuxuan Fan, Muyang Zhang, Ziyang Yan, Ao Ma, Xiaochen Wang, Hao Tang, Yan Wang, Shuyan Li

Abstract: Video Anomaly Detection~(VAD) focuses on identifying anomalies within videos. Supervised methods require an amount of in-domain training data and often struggle to generalize to unseen anomalies. In contrast, training-free methods leverage the intrinsic world knowledge of large language models (LLMs) to detect anomalies but face challenges in localizing fine-grained visual transitions and diverse… ▽ More Video Anomaly Detection~(VAD) focuses on identifying anomalies within videos. Supervised methods require an amount of in-domain training data and often struggle to generalize to unseen anomalies. In contrast, training-free methods leverage the intrinsic world knowledge of large language models (LLMs) to detect anomalies but face challenges in localizing fine-grained visual transitions and diverse events. Therefore, we propose EventVAD, an event-aware video anomaly detection framework that combines tailored dynamic graph architectures and multimodal LLMs through temporal-event reasoning. Specifically, EventVAD first employs dynamic spatiotemporal graph modeling with time-decay constraints to capture event-aware video features. Then, it performs adaptive noise filtering and uses signal ratio thresholding to detect event boundaries via unsupervised statistical features. The statistical boundary detection module reduces the complexity of processing long videos for MLLMs and improves their temporal reasoning through event consistency. Finally, it utilizes a hierarchical prompting strategy to guide MLLMs in performing reasoning before determining final decisions. We conducted extensive experiments on the UCF-Crime and XD-Violence datasets. The results demonstrate that EventVAD with a 7B MLLM achieves state-of-the-art (SOTA) in training-free settings, outperforming strong baselines that use 7B or larger MLLMs. △ Less

Submitted 17 April, 2025; originally announced April 2025.

arXiv:2504.04022 [pdf, other]

Rethinking Reflection in Pre-Training

Authors: Essential AI, :, Darsh J Shah, Peter Rushton, Somanshu Singla, Mohit Parmar, Kurt Smith, Yash Vanjani, Ashish Vaswani, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Anthony Polloreno, Ashish Tanwer, Burhan Drak Sibai, Divya S Mansingka, Divya Shivaprasad, Ishaan Shah, Karl Stratos, Khoi Nguyen, Michael Callahan, Michael Pust, Mrinal Iyer, Philip Monk , et al. (4 additional authors not shown)

Abstract: A language model's ability to reflect on its own reasoning provides a key advantage for solving complex problems. While most recent research has focused on how this ability develops during reinforcement learning, we show that it actually begins to emerge much earlier - during the model's pre-training. To study this, we introduce deliberate errors into chains-of-thought and test whether the model c… ▽ More A language model's ability to reflect on its own reasoning provides a key advantage for solving complex problems. While most recent research has focused on how this ability develops during reinforcement learning, we show that it actually begins to emerge much earlier - during the model's pre-training. To study this, we introduce deliberate errors into chains-of-thought and test whether the model can still arrive at the correct answer by recognizing and correcting these mistakes. By tracking performance across different stages of pre-training, we observe that this self-correcting ability appears early and improves steadily over time. For instance, an OLMo2-7B model pre-trained on 4 trillion tokens displays self-correction on our six self-reflection tasks. △ Less

Submitted 4 April, 2025; originally announced April 2025.

arXiv:2503.22122 [pdf, other]

REMAC: Self-Reflective and Self-Evolving Multi-Agent Collaboration for Long-Horizon Robot Manipulation

Authors: Puzhen Yuan, Angyuan Ma, Yunchao Yao, Huaxiu Yao, Masayoshi Tomizuka, Mingyu Ding

Abstract: Vision-language models (VLMs) have demonstrated remarkable capabilities in robotic planning, particularly for long-horizon tasks that require a holistic understanding of the environment for task decomposition. Existing methods typically rely on prior environmental knowledge or carefully designed task-specific prompts, making them struggle with dynamic scene changes or unexpected task conditions, e… ▽ More Vision-language models (VLMs) have demonstrated remarkable capabilities in robotic planning, particularly for long-horizon tasks that require a holistic understanding of the environment for task decomposition. Existing methods typically rely on prior environmental knowledge or carefully designed task-specific prompts, making them struggle with dynamic scene changes or unexpected task conditions, e.g., a robot attempting to put a carrot in the microwave but finds the door was closed. Such challenges underscore two critical issues: adaptability and efficiency. To address them, in this work, we propose an adaptive multi-agent planning framework, termed REMAC, that enables efficient, scene-agnostic multi-robot long-horizon task planning and execution through continuous reflection and self-evolution. REMAC incorporates two key modules: a self-reflection module performing pre-condition and post-condition checks in the loop to evaluate progress and refine plans, and a self-evolvement module dynamically adapting plans based on scene-specific reasoning. It offers several appealing benefits: 1) Robots can initially explore and reason about the environment without complex prompt design. 2) Robots can keep reflecting on potential planning errors and adapting the plan based on task-specific insights. 3) After iterations, a robot can call another one to coordinate tasks in parallel, maximizing the task execution efficiency. To validate REMAC's effectiveness, we build a multi-agent environment for long-horizon robot manipulation and navigation based on RoboCasa, featuring 4 task categories with 27 task styles and 50+ different objects. Based on it, we further benchmark state-of-the-art reasoning models, including DeepSeek-R1, o3-mini, QwQ, and Grok3, demonstrating REMAC's superiority by boosting average success rates by 40% and execution efficiency by 52.7% over the single robot baseline. △ Less

Submitted 27 March, 2025; originally announced March 2025.

arXiv:2503.21011 [pdf, other]

Can Large Language Models Predict Associations Among Human Attitudes?

Authors: Ana Ma, Derek Powell

Abstract: Prior work has shown that large language models (LLMs) can predict human attitudes based on other attitudes, but this work has largely focused on predictions from highly similar and interrelated attitudes. In contrast, human attitudes are often strongly associated even across disparate and dissimilar topics. Using a novel dataset of human responses toward diverse attitude statements, we found that… ▽ More Prior work has shown that large language models (LLMs) can predict human attitudes based on other attitudes, but this work has largely focused on predictions from highly similar and interrelated attitudes. In contrast, human attitudes are often strongly associated even across disparate and dissimilar topics. Using a novel dataset of human responses toward diverse attitude statements, we found that a frontier language model (GPT-4o) was able to recreate the pairwise correlations among individual attitudes and to predict individuals' attitudes from one another. Crucially, in an advance over prior work, we tested GPT-4o's ability to predict in the absence of surface-similarity between attitudes, finding that while surface similarity improves prediction accuracy, the model was still highly-capable of generating meaningful social inferences between dissimilar attitudes. Altogether, our findings indicate that LLMs capture crucial aspects of the deeper, latent structure of human belief systems. △ Less

Submitted 26 March, 2025; originally announced March 2025.

arXiv:2503.18888 [pdf, other]

Toward building next-generation Geocoding systems: a systematic review

Authors: Zhengcong Yin, Daniel W. Goldberg, Binbin Lin, Bing Zhou, Diya Li, Andong Ma, Ziqian Ming, Heng Cai, Zhe Zhang, Shaohua Wang, Shanzhen Gao, Joey Ying Lee, Xiao Li, Da Huo

Abstract: Geocoding systems are widely used in both scientific research for spatial analysis and everyday life through location-based services. The quality of geocoded data significantly impacts subsequent processes and applications, underscoring the need for next-generation systems. In response to this demand, this review first examines the evolving requirements for geocoding inputs and outputs across vari… ▽ More Geocoding systems are widely used in both scientific research for spatial analysis and everyday life through location-based services. The quality of geocoded data significantly impacts subsequent processes and applications, underscoring the need for next-generation systems. In response to this demand, this review first examines the evolving requirements for geocoding inputs and outputs across various scenarios these systems must address. It then provides a detailed analysis of how to construct such systems by breaking them down into key functional components and reviewing a broad spectrum of existing approaches, from traditional rule-based methods to advanced techniques in information retrieval, natural language processing, and large language models. Finally, we identify opportunities to improve next-generation geocoding systems in light of recent technological advances. △ Less

Submitted 24 March, 2025; originally announced March 2025.

arXiv:2503.10701 [pdf, other]

Video Individual Counting for Moving Drones

Authors: Yaowu Fan, Jia Wan, Tao Han, Antoni B. Chan, Andy J. Ma

Abstract: Video Individual Counting (VIC) has received increasing attentions recently due to its importance in intelligent video surveillance. Existing works are limited in two aspects, i.e., dataset and method. Previous crowd counting datasets are captured with fixed or rarely moving cameras with relatively sparse individuals, restricting evaluation for a highly varying view and time in crowded scenes. Whi… ▽ More Video Individual Counting (VIC) has received increasing attentions recently due to its importance in intelligent video surveillance. Existing works are limited in two aspects, i.e., dataset and method. Previous crowd counting datasets are captured with fixed or rarely moving cameras with relatively sparse individuals, restricting evaluation for a highly varying view and time in crowded scenes. While VIC methods have been proposed based on localization-then-association or localization-then-classification, they may not perform well due to difficulty in accurate localization of crowded and small targets under challenging scenarios. To address these issues, we collect a MovingDroneCrowd Dataset and propose a density map based VIC method. Different from existing datasets, our dataset consists of videos captured by fast-moving drones in crowded scenes under diverse illuminations, shooting heights and angles. Other than localizing individuals, we propose a Depth-wise Cross-Frame Attention (DCFA) module, which directly estimate inflow and outflow density maps through learning shared density maps between consecutive frames. The inflow density maps across frames are summed up to obtain the number of unique pedestrians in a video. Experiments on our datasets and publicly available ones show the superiority of our method over the state of the arts for VIC in highly dynamic and complex crowded scenes. Our dataset and codes will be released publicly. △ Less

Submitted 12 March, 2025; originally announced March 2025.

arXiv:2503.10127 [pdf, other]

PlanGen: Towards Unified Layout Planning and Image Generation in Auto-Regressive Vision Language Models

Authors: Runze He, Bo Cheng, Yuhang Ma, Qingxiang Jia, Shanyuan Liu, Ao Ma, Xiaoyu Wu, Liebucha Wu, Dawei Leng, Yuhui Yin

Abstract: In this paper, we propose a unified layout planning and image generation model, PlanGen, which can pre-plan spatial layout conditions before generating images. Unlike previous diffusion-based models that treat layout planning and layout-to-image as two separate models, PlanGen jointly models the two tasks into one autoregressive transformer using only next-token prediction. PlanGen integrates layo… ▽ More In this paper, we propose a unified layout planning and image generation model, PlanGen, which can pre-plan spatial layout conditions before generating images. Unlike previous diffusion-based models that treat layout planning and layout-to-image as two separate models, PlanGen jointly models the two tasks into one autoregressive transformer using only next-token prediction. PlanGen integrates layout conditions into the model as context without requiring specialized encoding of local captions and bounding box coordinates, which provides significant advantages over the previous embed-and-pool operations on layout conditions, particularly when dealing with complex layouts. Unified prompting allows PlanGen to perform multitasking training related to layout, including layout planning, layout-to-image generation, image layout understanding, etc. In addition, PlanGen can be seamlessly expanded to layout-guided image manipulation thanks to the well-designed modeling, with teacher-forcing content manipulation policy and negative layout guidance. Extensive experiments verify the effectiveness of our PlanGen in multiple layoutrelated tasks, showing its great potential. Code is available at: https://360cvgroup.github.io/PlanGen. △ Less

Submitted 30 March, 2025; v1 submitted 13 March, 2025; originally announced March 2025.

Comments: 15 pages, 12 figures, project page: https://360cvgroup.github.io/PlanGen

arXiv:2503.09242 [pdf, other]

NAMI: Efficient Image Generation via Progressive Rectified Flow Transformers

Authors: Yuhang Ma, Bo Cheng, Shanyuan Liu, Ao Ma, Xiaoyu Wu, Liebucha Wu, Dawei Leng, Yuhui Yin

Abstract: Flow-based transformer models for image generation have achieved state-of-the-art performance with larger model parameters, but their inference deployment cost remains high. To enhance inference performance while maintaining generation quality, we propose progressive rectified flow transformers. We divide the rectified flow into different stages according to resolution, using fewer transformer lay… ▽ More Flow-based transformer models for image generation have achieved state-of-the-art performance with larger model parameters, but their inference deployment cost remains high. To enhance inference performance while maintaining generation quality, we propose progressive rectified flow transformers. We divide the rectified flow into different stages according to resolution, using fewer transformer layers at the low-resolution stages to generate image layouts and concept contours, and progressively adding more layers as the resolution increases. Experiments demonstrate that our approach achieves fast convergence and reduces inference time while ensuring generation quality. The main contributions of this paper are summarized as follows: (1) We introduce progressive rectified flow transformers that enable multi-resolution training, accelerating model convergence; (2) NAMI leverages piecewise flow and spatial cascading of Diffusion Transformer (DiT) to rapidly generate images, reducing inference time by 40% to generate a 1024 resolution image; (3) We propose NAMI-1K benchmark to evaluate human preference performance, aiming to mitigate distributional bias and prevent data leakage from open-source benchmarks. The results show that our model is competitive with state-of-the-art models. △ Less

Submitted 12 March, 2025; originally announced March 2025.

arXiv:2503.08157 [pdf, other]

U-StyDiT: Ultra-high Quality Artistic Style Transfer Using Diffusion Transformers

Authors: Zhanjie Zhang, Ao Ma, Ke Cao, Jing Wang, Shanyuan Liu, Yuhang Ma, Bo Cheng, Dawei Leng, Yuhui Yin

Abstract: Ultra-high quality artistic style transfer refers to repainting an ultra-high quality content image using the style information learned from the style image. Existing artistic style transfer methods can be categorized into style reconstruction-based and content-style disentanglement-based style transfer approaches. Although these methods can generate some artistic stylized images, they still exhib… ▽ More Ultra-high quality artistic style transfer refers to repainting an ultra-high quality content image using the style information learned from the style image. Existing artistic style transfer methods can be categorized into style reconstruction-based and content-style disentanglement-based style transfer approaches. Although these methods can generate some artistic stylized images, they still exhibit obvious artifacts and disharmonious patterns, which hinder their ability to produce ultra-high quality artistic stylized images. To address these issues, we propose a novel artistic image style transfer method, U-StyDiT, which is built on transformer-based diffusion (DiT) and learns content-style disentanglement, generating ultra-high quality artistic stylized images. Specifically, we first design a Multi-view Style Modulator (MSM) to learn style information from a style image from local and global perspectives, conditioning U-StyDiT to generate stylized images with the learned style information. Then, we introduce a StyDiT Block to learn content and style conditions simultaneously from a style image. Additionally, we propose an ultra-high quality artistic image dataset, Aes4M, comprising 10 categories, each containing 400,000 style images. This dataset effectively solves the problem that the existing style transfer methods cannot produce high-quality artistic stylized images due to the size of the dataset and the quality of the images in the dataset. Finally, the extensive qualitative and quantitative experiments validate that our U-StyDiT can create higher quality stylized images compared to state-of-the-art artistic style transfer methods. To our knowledge, our proposed method is the first to address the generation of ultra-high quality stylized images using transformer-based diffusion. △ Less

Submitted 11 March, 2025; originally announced March 2025.

arXiv:2503.08153 [pdf, other]

WISA: World Simulator Assistant for Physics-Aware Text-to-Video Generation

Authors: Jing Wang, Ao Ma, Ke Cao, Jun Zheng, Zhanjie Zhang, Jiasong Feng, Shanyuan Liu, Yuhang Ma, Bo Cheng, Dawei Leng, Yuhui Yin, Xiaodan Liang

Abstract: Recent rapid advancements in text-to-video (T2V) generation, such as SoRA and Kling, have shown great potential for building world simulators. However, current T2V models struggle to grasp abstract physical principles and generate videos that adhere to physical laws. This challenge arises primarily from a lack of clear guidance on physical information due to a significant gap between abstract phys… ▽ More Recent rapid advancements in text-to-video (T2V) generation, such as SoRA and Kling, have shown great potential for building world simulators. However, current T2V models struggle to grasp abstract physical principles and generate videos that adhere to physical laws. This challenge arises primarily from a lack of clear guidance on physical information due to a significant gap between abstract physical principles and generation models. To this end, we introduce the World Simulator Assistant (WISA), an effective framework for decomposing and incorporating physical principles into T2V models. Specifically, WISA decomposes physical principles into textual physical descriptions, qualitative physical categories, and quantitative physical properties. To effectively embed these physical attributes into the generation process, WISA incorporates several key designs, including Mixture-of-Physical-Experts Attention (MoPA) and a Physical Classifier, enhancing the model's physics awareness. Furthermore, most existing datasets feature videos where physical phenomena are either weakly represented or entangled with multiple co-occurring processes, limiting their suitability as dedicated resources for learning explicit physical principles. We propose a novel video dataset, WISA-32K, collected based on qualitative physical categories. It consists of 32,000 videos, representing 17 physical laws across three domains of physics: dynamics, thermodynamics, and optics. Experimental results demonstrate that WISA can effectively enhance the compatibility of T2V models with real-world physical laws, achieving a considerable improvement on the VideoPhy benchmark. The visual exhibitions of WISA and WISA-32K are available in the https://360cvgroup.github.io/WISA/. △ Less

Submitted 11 March, 2025; originally announced March 2025.

arXiv:2503.02112 [pdf, other]

Building Machine Learning Challenges for Anomaly Detection in Science

Authors: Elizabeth G. Campolongo, Yuan-Tang Chou, Ekaterina Govorkova, Wahid Bhimji, Wei-Lun Chao, Chris Harris, Shih-Chieh Hsu, Hilmar Lapp, Mark S. Neubauer, Josephine Namayanja, Aneesh Subramanian, Philip Harris, Advaith Anand, David E. Carlyn, Subhankar Ghosh, Christopher Lawrence, Eric Moreno, Ryan Raikman, Jiaman Wu, Ziheng Zhang, Bayu Adhi, Mohammad Ahmadi Gharehtoragh, Saúl Alonso Monsalve, Marta Babicz, Furqan Baig , et al. (125 additional authors not shown)

Abstract: Scientific discoveries are often made by finding a pattern or object that was not predicted by the known rules of science. Oftentimes, these anomalous events or objects that do not conform to the norms are an indication that the rules of science governing the data are incomplete, and something new needs to be present to explain these unexpected outliers. The challenge of finding anomalies can be c… ▽ More Scientific discoveries are often made by finding a pattern or object that was not predicted by the known rules of science. Oftentimes, these anomalous events or objects that do not conform to the norms are an indication that the rules of science governing the data are incomplete, and something new needs to be present to explain these unexpected outliers. The challenge of finding anomalies can be confounding since it requires codifying a complete knowledge of the known scientific behaviors and then projecting these known behaviors on the data to look for deviations. When utilizing machine learning, this presents a particular challenge since we require that the model not only understands scientific data perfectly but also recognizes when the data is inconsistent and out of the scope of its trained behavior. In this paper, we present three datasets aimed at developing machine learning-based anomaly detection for disparate scientific domains covering astrophysics, genomics, and polar science. We present the different datasets along with a scheme to make machine learning challenges around the three datasets findable, accessible, interoperable, and reusable (FAIR). Furthermore, we present an approach that generalizes to future machine learning challenges, enabling the possibility of large, more compute-intensive challenges that can ultimately lead to scientific discovery. △ Less

Submitted 29 March, 2025; v1 submitted 3 March, 2025; originally announced March 2025.

Comments: 17 pages 6 figures to be submitted to Nature Communications

arXiv:2502.14377 [pdf, other]

RelaCtrl: Relevance-Guided Efficient Control for Diffusion Transformers

Authors: Ke Cao, Jing Wang, Ao Ma, Jiasong Feng, Zhanjie Zhang, Xuanhua He, Shanyuan Liu, Bo Cheng, Dawei Leng, Yuhui Yin, Jie Zhang

Abstract: The Diffusion Transformer plays a pivotal role in advancing text-to-image and text-to-video generation, owing primarily to its inherent scalability. However, existing controlled diffusion transformer methods incur significant parameter and computational overheads and suffer from inefficient resource allocation due to their failure to account for the varying relevance of control information across… ▽ More The Diffusion Transformer plays a pivotal role in advancing text-to-image and text-to-video generation, owing primarily to its inherent scalability. However, existing controlled diffusion transformer methods incur significant parameter and computational overheads and suffer from inefficient resource allocation due to their failure to account for the varying relevance of control information across different transformer layers. To address this, we propose the Relevance-Guided Efficient Controllable Generation framework, RelaCtrl, enabling efficient and resource-optimized integration of control signals into the Diffusion Transformer. First, we evaluate the relevance of each layer in the Diffusion Transformer to the control information by assessing the "ControlNet Relevance Score"-i.e., the impact of skipping each control layer on both the quality of generation and the control effectiveness during inference. Based on the strength of the relevance, we then tailor the positioning, parameter scale, and modeling capacity of the control layers to reduce unnecessary parameters and redundant computations. Additionally, to further improve efficiency, we replace the self-attention and FFN in the commonly used copy block with the carefully designed Two-Dimensional Shuffle Mixer (TDSM), enabling efficient implementation of both the token mixer and channel mixer. Both qualitative and quantitative experimental results demonstrate that our approach achieves superior performance with only 15% of the parameters and computational complexity compared to PixArt-delta. △ Less

Submitted 23 March, 2025; v1 submitted 20 February, 2025; originally announced February 2025.

Comments: Homepage: https://360cvgroup.github.io/RelaCtrl/ Github: https://github.com/360CVGroup/RelaCtrl

arXiv:2502.10381 [pdf, ps, other]

Balancing the Scales: A Theoretical and Algorithmic Framework for Learning from Imbalanced Data

Authors: Corinna Cortes, Anqi Mao, Mehryar Mohri, Yutao Zhong

Abstract: Class imbalance remains a major challenge in machine learning, especially in multi-class problems with long-tailed distributions. Existing methods, such as data resampling, cost-sensitive techniques, and logistic loss modifications, though popular and often effective, lack solid theoretical foundations. As an example, we demonstrate that cost-sensitive methods are not Bayes-consistent. This paper… ▽ More Class imbalance remains a major challenge in machine learning, especially in multi-class problems with long-tailed distributions. Existing methods, such as data resampling, cost-sensitive techniques, and logistic loss modifications, though popular and often effective, lack solid theoretical foundations. As an example, we demonstrate that cost-sensitive methods are not Bayes-consistent. This paper introduces a novel theoretical framework for analyzing generalization in imbalanced classification. We then propose a new class-imbalanced margin loss function for both binary and multi-class settings, prove its strong $H$-consistency, and derive corresponding learning guarantees based on empirical loss and a new notion of class-sensitive Rademacher complexity. Leveraging these theoretical results, we devise novel and general learning algorithms, IMMAX (Imbalanced Margin Maximization), which incorporate confidence margins and are applicable to various hypothesis sets. While our focus is theoretical, we also present extensive empirical results demonstrating the effectiveness of our algorithms compared to existing baselines. △ Less

Submitted 25 June, 2025; v1 submitted 14 February, 2025; originally announced February 2025.

Comments: ICML 2025

arXiv:2502.01925 [pdf, ps, other]

PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling

Authors: Avery Ma, Yangchen Pan, Amir-massoud Farahmand

Abstract: Many-shot jailbreaking circumvents the safety alignment of LLMs by exploiting their ability to process long input sequences. To achieve this, the malicious target prompt is prefixed with hundreds of fabricated conversational exchanges between the user and the model. These exchanges are randomly sampled from a pool of unsafe question-answer pairs, making it appear as though the model has already co… ▽ More Many-shot jailbreaking circumvents the safety alignment of LLMs by exploiting their ability to process long input sequences. To achieve this, the malicious target prompt is prefixed with hundreds of fabricated conversational exchanges between the user and the model. These exchanges are randomly sampled from a pool of unsafe question-answer pairs, making it appear as though the model has already complied with harmful instructions. In this paper, we present PANDAS: a hybrid technique that improves many-shot jailbreaking by modifying these fabricated dialogues with Positive Affirmations, Negative Demonstrations, and an optimized Adaptive Sampling method tailored to the target prompt's topic. We also introduce ManyHarm, a dataset of harmful question-answer pairs, and demonstrate through extensive experiments that PANDAS significantly outperforms baseline methods in long-context scenarios. Through attention analysis, we provide insights into how long-context vulnerabilities are exploited and show how PANDAS further improves upon many-shot jailbreaking. △ Less

Submitted 12 June, 2025; v1 submitted 3 February, 2025; originally announced February 2025.

Comments: Accepted at ICML 2025 (Spotlight). Code: https://github.com/averyma/pandas

arXiv:2501.12945 [pdf, other]

doi 10.1088/2058-9565/add04d

Unraveling quantum phase estimation: exploring the impact of multi-photon interference on the quantum Fisher information

Authors: Annameng Ma, Agustina G. Magnoni, Miguel A. Larotonda, Laura T. Knoll

Abstract: Quantum interference is known to become extinct with distinguishing information, as illustrated by the ubiquitous double-slit experiment or the two-photon HOM effect. In the former case single particle interference is destroyed with which-path information while in the latter bunching interference tails-off as photons become distinguishable. It has been observed that when more than two particles ar… ▽ More Quantum interference is known to become extinct with distinguishing information, as illustrated by the ubiquitous double-slit experiment or the two-photon HOM effect. In the former case single particle interference is destroyed with which-path information while in the latter bunching interference tails-off as photons become distinguishable. It has been observed that when more than two particles are involved, these interference patterns are in general a non monotonic function of the distinguishability. Here we perform a comprehensive characterization, both theoretically and experimentally, of four-photon interference by analyzing the corresponding correlation functions, contemplating several degrees of distinguishability across different parameters. This study provides all the necessary tools to quantify the impact of multi-photon interference on precision measurements of parameters such as phase, frequency, and time difference. We apply these insights to quantify the precision in the estimation of an interferometric phase in a two-port interferometer using a four-photon state. Our results reveal that, for certain phase values, partially distinguishable multi-photon states can achieve higher Fisher information values compared to the two-photon experiment. These findings highlight the potential of distinguishable multi-photon states for enhanced precision in quantum metrology and related applications. △ Less

Submitted 25 April, 2025; v1 submitted 22 January, 2025; originally announced January 2025.

Comments: 13 pages, 6 figures

arXiv:2501.12427 [pdf, other]

SafePowerGraph-HIL: Real-Time HIL Validation of Heterogeneous GNNs for Bridging Sim-to-Real Gap in Power Grids

Authors: Aoxiang Ma, Salah Ghamizi, Jun Cao, Pedro Rodriguez

Abstract: As machine learning (ML) techniques gain prominence in power system research, validating these methods' effectiveness under real-world conditions requires real-time hardware-in-the-loop (HIL) simulations. HIL simulation platforms enable the integration of computational models with physical devices, allowing rigorous testing across diverse scenarios critical to system resilience and reliability. In… ▽ More As machine learning (ML) techniques gain prominence in power system research, validating these methods' effectiveness under real-world conditions requires real-time hardware-in-the-loop (HIL) simulations. HIL simulation platforms enable the integration of computational models with physical devices, allowing rigorous testing across diverse scenarios critical to system resilience and reliability. In this study, we develop a SafePowerGraph-HIL framework that utilizes HIL simulations on the IEEE 9-bus system, modeled in Hypersim, to generate high-fidelity data, which is then transmitted in real-time via SCADA to an AWS cloud database before being input into a Heterogeneous Graph Neural Network (HGNN) model designed for power system state estimation and dynamic analysis. By leveraging Hypersim's capabilities, we simulate complex grid interactions, providing a robust dataset that captures critical parameters for HGNN training. The trained HGNN is subsequently validated using newly generated data under varied system conditions, demonstrating accuracy and robustness in predicting power system states. The results underscore the potential of integrating HIL with advanced neural network architectures to enhance the real-time operational capabilities of power systems. This approach represents a significant advancement toward the development of intelligent, adaptive control strategies that support the robustness and resilience of evolving power grids. △ Less

Submitted 21 January, 2025; originally announced January 2025.

Comments: 5 pages, 5 figures

arXiv:2501.11570 [pdf, other]

Uncertainty Estimation in the Real World: A Study on Music Emotion Recognition

Authors: Karn N. Watcharasupat, Yiwei Ding, T. Aleksandra Ma, Pavan Seshadri, Alexander Lerch

Abstract: Any data annotation for subjective tasks shows potential variations between individuals. This is particularly true for annotations of emotional responses to musical stimuli. While older approaches to music emotion recognition systems frequently addressed this uncertainty problem through probabilistic modeling, modern systems based on neural networks tend to ignore the variability and focus only on… ▽ More Any data annotation for subjective tasks shows potential variations between individuals. This is particularly true for annotations of emotional responses to musical stimuli. While older approaches to music emotion recognition systems frequently addressed this uncertainty problem through probabilistic modeling, modern systems based on neural networks tend to ignore the variability and focus only on predicting central tendencies of human subjective responses. In this work, we explore several methods for estimating not only the central tendencies of the subjective responses to a musical stimulus, but also for estimating the uncertainty associated with these responses. In particular, we investigate probabilistic loss functions and inference-time random sampling. Experimental results indicate that while the modeling of the central tendencies is achievable, modeling of the uncertainty in subjective responses proves significantly more challenging with currently available approaches even when empirical estimates of variations in the responses are available. △ Less

Submitted 20 January, 2025; originally announced January 2025.

Comments: To be presented as a Findings paper at the 2025 European Conference on Information Retrieval (ECIR)

arXiv:2501.07272 [pdf, other]

A Large-Scale Reconfigurable Multiplexed Quantum Photonic Network

Authors: Natalia Herrera Valencia, Annameng Ma, Suraj Goel, Saroch Leedumrongwatthanakun, Francesco Graffitti, Alessandro Fedrizzi, Will McCutcheon, Mehul Malik

Abstract: Entanglement distribution in quantum networks will enable next-generation technologies for quantum-secured communications, distributed quantum computing and sensing. Future quantum networks will require dense connectivity, allowing multiple users to share entanglement in a reconfigurable and multiplexed manner, while long-distance connections are established through the teleportation of entangleme… ▽ More Entanglement distribution in quantum networks will enable next-generation technologies for quantum-secured communications, distributed quantum computing and sensing. Future quantum networks will require dense connectivity, allowing multiple users to share entanglement in a reconfigurable and multiplexed manner, while long-distance connections are established through the teleportation of entanglement, or entanglement swapping. While several recent works have demonstrated fully connected, local multi-user networks based on multiplexing, extending this to a global network architecture of interconnected local networks remains an outstanding challenge. Here we demonstrate the next stage in the evolution of multiplexed quantum networks: a prototype global reconfigurable network where entanglement is routed and teleported in a flexible and multiplexed manner between two local multi-user networks composed of four users each. At the heart of our network is a programmable 8x8-dimensional multi-port circuit that harnesses the natural mode-mixing process inside a multi-mode fibre to implement on-demand high-dimensional operations on two independent photons carrying eight transverse-spatial modes. Our circuit design allows us to break away from the limited planar geometry and bypass the control and fabrication challenges of conventional integrated photonic platforms. Our demonstration showcases the potential of this architecture for enabling large-scale, global quantum networks that offer versatile connectivity while being fully compatible with an existing communications infrastructure. △ Less

Submitted 27 January, 2025; v1 submitted 13 January, 2025; originally announced January 2025.

arXiv:2501.02932 [pdf, other]

Predicting band gap from chemical composition: A simple learned model for a material property with atypical statistics

Authors: Andrew Ma, Owen Dugan, Marin Soljačić

Abstract: In solid-state materials science, substantial efforts have been devoted to the calculation and modeling of the electronic band gap. While a wide range of ab initio methods and machine learning algorithms have been created that can predict this quantity, the development of new computational approaches for studying the band gap remains an active area of research. Here we introduce a simple machine l… ▽ More In solid-state materials science, substantial efforts have been devoted to the calculation and modeling of the electronic band gap. While a wide range of ab initio methods and machine learning algorithms have been created that can predict this quantity, the development of new computational approaches for studying the band gap remains an active area of research. Here we introduce a simple machine learning model for predicting the band gap using only the chemical composition of the crystalline material. To motivate the form of the model, we first analyze the empirical distribution of the band gap, which sheds new light on its atypical statistics. Specifically, our analysis enables us to frame band gap prediction as a task of modeling a mixed random variable, and we design our model accordingly. Our model formulation incorporates thematic ideas from chemical heuristic models for other material properties in a manner that is suited towards the band gap modeling task. The model has exactly one parameter corresponding to each element, which is fit using data. To predict the band gap for a given material, the model computes a weighted average of the parameters associated with its constituent elements and then takes the maximum of this quantity and zero. The model provides heuristic chemical interpretability by intuitively capturing the associations between the band gap and individual chemical elements. △ Less

Submitted 6 January, 2025; originally announced January 2025.

Comments: 9 pages, 4 figures

arXiv:2501.00380 [pdf, other]

doi 10.1051/0004-6361/202451734

An efficient unsupervised classification model for galaxy morphology: Voting clustering based on coding from ConvNeXt large model

Authors: Guanwen Fang, Yao Dai, Zesen Lin, Chichun Zhou, Jie Song, Yizhou Gu, Xiaotong Guo, Anqi Mao, Xu Kong

Abstract: In this work, we update the unsupervised machine learning (UML) step by proposing an algorithm based on ConvNeXt large model coding to improve the efficiency of unlabeled galaxy morphology classifications. The method can be summarized into three key aspects as follows: (1) a convolutional autoencoder is used for image denoising and reconstruction and the rotational invariance of the model is impro… ▽ More In this work, we update the unsupervised machine learning (UML) step by proposing an algorithm based on ConvNeXt large model coding to improve the efficiency of unlabeled galaxy morphology classifications. The method can be summarized into three key aspects as follows: (1) a convolutional autoencoder is used for image denoising and reconstruction and the rotational invariance of the model is improved by polar coordinate extension; (2) utilizing a pre-trained convolutional neural network (CNN) named ConvNeXt for encoding the image data. The features were further compressed via a principal component analysis (PCA) dimensionality reduction; (3) adopting a bagging-based multi-model voting classification algorithm to enhance robustness. We applied this model to I-band images of a galaxy sample with $I_{\rm mag}< 25$ in the COSMOS field. Compared to the original unsupervised method, the number of clustering groups required by the new method is reduced from 100 to 20. Finally, we managed to classify about 53\% galaxies, significantly improving the classification efficiency. To verify the validity of the morphological classification, we selected massive galaxies with $M(*)>10^{10}(M(sun))$ for morphological parameter tests. The corresponding rules between the classification results and the physical properties of galaxies on multiple parameter surfaces are consistent with the existing evolution model. Our method has demonstrated the feasibility of using large model encoding to classify galaxy morphology, which not only improves the efficiency of galaxy morphology classification, but also saves time and manpower. Furthermore, in comparison to the original UML model, the enhanced classification performance is more evident in qualitative analysis and has successfully surpassed a greater number of parameter tests. △ Less

Submitted 31 December, 2024; originally announced January 2025.

Comments: Accepted by A&A; 12 pages, 12 figures

arXiv:2412.19552 [pdf, ps, other]

Contrast-Optimized Basis Functions for Self-Navigated Motion Correction in Quantitative MRI

Authors: Elisa Marchetto, Sebastian Flassbeck, Andrew Mao, Jakob Assländer

Abstract: Purpose: The long scan times of quantitative MRI techniques make motion artifacts more likely. For MR-Fingerprinting-like approaches, this problem can be addressed with self-navigated retrospective motion correction based on reconstructions in a singular value decomposition (SVD) subspace. However, the SVD promotes high signal intensity in all tissues, which limits the contrast between tissue type… ▽ More Purpose: The long scan times of quantitative MRI techniques make motion artifacts more likely. For MR-Fingerprinting-like approaches, this problem can be addressed with self-navigated retrospective motion correction based on reconstructions in a singular value decomposition (SVD) subspace. However, the SVD promotes high signal intensity in all tissues, which limits the contrast between tissue types and ultimately reduces the accuracy of registration. The purpose of this paper is to rotate the subspace for maximum contrast between two types of tissue and improve the accuracy of motion estimates. Methods: A subspace is derived that promotes contrasts between brain parenchyma and CSF, achieved through the generalized eigendecomposition of mean autocorrelation matrices, followed by a Gram-Schmidt process to maintain orthogonality. We tested our motion correction method on 85 scans with varying motion levels, acquired with a 3D hybrid-state sequence optimized for quantitative magnetization transfer imaging. Results: A comparative analysis shows that the contrast-optimized basis significantly improve the parenchyma-CSF contrast, leading to smoother motion estimates and reduced artifacts in the quantitative maps. Conclusion: The proposed contrast-optimized subspace improves the accuracy of the motion estimation. △ Less

Submitted 17 June, 2025; v1 submitted 27 December, 2024; originally announced December 2024.

arXiv:2412.16434 [pdf, other]

SYMPHONY: Improving Memory Management for LLM Inference Workloads

Authors: Saurabh Agarwal, Anyong Mao, Aditya Akella, Shivaram Venkataraman

Abstract: Large Language Models (LLMs) are increasingly being deployed in applications such as chatbots, code editors, and conversational agents. A key feature of LLMs is their ability to engage in multi-turn interactions with humans or external tools, enabling a wide range of tasks. Each new request in a multi-turn interaction depends on the intermediate state, specifically the key-value (K,V) caches, from… ▽ More Large Language Models (LLMs) are increasingly being deployed in applications such as chatbots, code editors, and conversational agents. A key feature of LLMs is their ability to engage in multi-turn interactions with humans or external tools, enabling a wide range of tasks. Each new request in a multi-turn interaction depends on the intermediate state, specifically the key-value (K,V) caches, from previous requests in the ongoing interaction. Existing serving engines either recompute the K,V caches or offload them to main memory. Profiling reveals that recomputation can result in over 99% of processed tokens being redundant. On the other hand, offloading K,V caches from GPU memory makes inference serving stateful, leading to load imbalances across the cluster. To address these challenges, we developed SYMPHONY. SYMPHONY leverages the observation that multi-turn work loads provide additional hints that allow K,V caches to be migrated off the critical serving path. By utilizing these hints, SYMPHONY dynamically migrates K,V caches to enable finegrained scheduling of inference requests. Our experiments demonstrate that SYMPHONY can handle over 8x the number of requests compared to state-of-the-art baselines, with a similar latency profile. △ Less

Submitted 20 December, 2024; originally announced December 2024.

arXiv:2412.09314 [pdf, other]

A first glimpse at the MeerKAT DEEP2 field at S-band

Authors: S. Ranchod, J. D. Wagenveld, H. -R. Klöckner, O. Wucknitz, R. P. Deane, S. S. Sridhar, E. Barr, S. Buchner, F. Camilo, A. Damas-Segovia, C. Kasemann, M. Kramer, L. S. Legodi, S. A. Mao, K. Menten, I. Rammala, M. R. Rugel, G. Wieching

Abstract: We present the first widefield extragalactic continuum catalogue with the MeerKAT S-band (2.5 GHz), of the radio-selected DEEP2 field. The combined image over the S1 (1.96 - 2.84 GHz) and S4 (2.62 - 3.50 GHz) sub-bands has an angular resolution of 6.8''$\times$3.6'' (4.0''$\times$2.4'') at a robust weighting of $R = 0.3$ ($R=-0.5$) and a sensitivity of 4.7 (7.5) $μ$Jy beam$^{-1}$ with an on-source… ▽ More We present the first widefield extragalactic continuum catalogue with the MeerKAT S-band (2.5 GHz), of the radio-selected DEEP2 field. The combined image over the S1 (1.96 - 2.84 GHz) and S4 (2.62 - 3.50 GHz) sub-bands has an angular resolution of 6.8''$\times$3.6'' (4.0''$\times$2.4'') at a robust weighting of $R = 0.3$ ($R=-0.5$) and a sensitivity of 4.7 (7.5) $μ$Jy beam$^{-1}$ with an on-source integration time of 70 minutes and a minimum of 52 of the 64 antennas, for respective observations. We present the differential source counts for this field, as well as a morphological comparison of resolved sources between S-band and archival MeerKAT L-band images. We find consistent source counts with the literature and provide spectral indices fitted over a combined frequency range of 1.8 GHz. These observations provide an important first demonstration of the capabilities of MeerKAT S-band imaging with relatively short integration times, as well as a comparison with existing S-band surveys, highlighting the rich scientific potential with future MeerKAT S-band surveys. △ Less

Submitted 12 December, 2024; originally announced December 2024.

Comments: 16 pages, 12 figures, 7 tables, Accepted for publication in MNRAS

arXiv:2412.04400 [pdf]

Enhanced Sampling of Protein Conformational Changes via True Reaction Coordinates from Energy Relaxation

Authors: Huiyu Li, Ao Ma

Abstract: The bottleneck in enhanced sampling lies in finding collective variables (CVs) that can effectively accelerate protein conformational changes. True reaction coordinates (tRCs) that can predict the committor are considered the optimal CVs, but identifying them requires unbiased natural reactive trajectories, which, paradoxically, depend on effective enhanced sampling. Using the generalized work fun… ▽ More The bottleneck in enhanced sampling lies in finding collective variables (CVs) that can effectively accelerate protein conformational changes. True reaction coordinates (tRCs) that can predict the committor are considered the optimal CVs, but identifying them requires unbiased natural reactive trajectories, which, paradoxically, depend on effective enhanced sampling. Using the generalized work functional method, we found that tRCs control both conformational changes and energy relaxation, enabling us to compute tRCs from energy relaxation simulations. Applying bias to tRCs accelerated conformational changes and ligand dissociation in HIV-1 protease and the PDZ2 domain by 10^5 to 10^15-fold. The resulting trajectories follow natural transition pathways, enabling efficient generation of natural reactive trajectories. In contrast, biased trajectories from empirical CVs often display non-physical features. Furthermore, by computing tRCs from a single protein structure, our method enables predictive sampling of conformational changes. These findings significantly broaden the range of protein functional processes accessible to molecular dynamics simulations. △ Less

Submitted 5 December, 2024; originally announced December 2024.

arXiv:2410.16644 [pdf]

CKSP: Cross-species Knowledge Sharing and Preserving for Universal Animal Activity Recognition

Authors: Axiu Mao, Meilu Zhu, Zhaojin Guo, Zheng He, Tomas Norton, Kai Liu

Abstract: Deep learning techniques are dominating automated animal activity recognition (AAR) tasks with wearable sensors due to their high performance on large-scale labelled data. However, current deep learning-based AAR models are trained solely on datasets of individual animal species, constraining their applicability in practice and performing poorly when training data are limited. In this study, we pr… ▽ More Deep learning techniques are dominating automated animal activity recognition (AAR) tasks with wearable sensors due to their high performance on large-scale labelled data. However, current deep learning-based AAR models are trained solely on datasets of individual animal species, constraining their applicability in practice and performing poorly when training data are limited. In this study, we propose a one-for-many framework, dubbed Cross-species Knowledge Sharing and Preserving (CKSP), based on sensor data of diverse animal species. Given the coexistence of generic and species-specific behavioural patterns among different species, we design a Shared-Preserved Convolution (SPConv) module. This module assigns an individual low-rank convolutional layer to each species for extracting species-specific features and employs a shared full-rank convolutional layer to learn generic features, enabling the CKSP framework to learn inter-species complementarity and alleviating data limitations via increasing data diversity. Considering the training conflict arising from discrepancies in data distributions among species, we devise a Species-specific Batch Normalization (SBN) module, that involves multiple BN layers to separately fit the distributions of different species. To validate CKSP's effectiveness, experiments are performed on three public datasets from horses, sheep, and cattle, respectively. The results show that our approach remarkably boosts the classification performance compared to the baseline method (one-for-one framework) solely trained on individual-species data, with increments of 6.04%, 2.06%, and 3.66% in accuracy, and 10.33%, 3.67%, and 7.90% in F1-score for the horse, sheep, and cattle datasets, respectively. This proves the promising capabilities of our method in leveraging multi-species data to augment classification performance. △ Less

Submitted 21 October, 2024; originally announced October 2024.

arXiv:2410.14324 [pdf, other]

HiCo: Hierarchical Controllable Diffusion Model for Layout-to-image Generation

Authors: Bo Cheng, Yuhang Ma, Liebucha Wu, Shanyuan Liu, Ao Ma, Xiaoyu Wu, Dawei Leng, Yuhui Yin

Abstract: The task of layout-to-image generation involves synthesizing images based on the captions of objects and their spatial positions. Existing methods still struggle in complex layout generation, where common bad cases include object missing, inconsistent lighting, conflicting view angles, etc. To effectively address these issues, we propose a \textbf{Hi}erarchical \textbf{Co}ntrollable (HiCo) diffusi… ▽ More The task of layout-to-image generation involves synthesizing images based on the captions of objects and their spatial positions. Existing methods still struggle in complex layout generation, where common bad cases include object missing, inconsistent lighting, conflicting view angles, etc. To effectively address these issues, we propose a \textbf{Hi}erarchical \textbf{Co}ntrollable (HiCo) diffusion model for layout-to-image generation, featuring object seperable conditioning branch structure. Our key insight is to achieve spatial disentanglement through hierarchical modeling of layouts. We use a multi branch structure to represent hierarchy and aggregate them in fusion module. To evaluate the performance of multi-objective controllable layout generation in natural scenes, we introduce the HiCo-7K benchmark, derived from the GRIT-20M dataset and manually cleaned. https://github.com/360CVGroup/HiCo_T2I. △ Less

Submitted 18 October, 2024; originally announced October 2024.

Comments: NeurIPS2024

arXiv:2410.13395 [pdf, ps, other]

Reverse Quantile-RK and its Application to Quantile-RK

Authors: Emeric Battaglia, Anna Ma

Abstract: When solving linear systems $Ax=b$, $A$ and $b$ are given, but the measurements $b$ often contain corruptions. Inspired by recent work on the quantile-randomized Kaczmarz method, we propose an acceleration of the randomized Kaczmarz method using quantile information. We show that the proposed acceleration converges faster than the randomized Kaczmarz algorithm. In addition, we show that our propos… ▽ More When solving linear systems $Ax=b$, $A$ and $b$ are given, but the measurements $b$ often contain corruptions. Inspired by recent work on the quantile-randomized Kaczmarz method, we propose an acceleration of the randomized Kaczmarz method using quantile information. We show that the proposed acceleration converges faster than the randomized Kaczmarz algorithm. In addition, we show that our proposed approach can be used in conjunction with the quantile-randomized Kaczamrz algorithm, without adding additional computational complexity, to produce both a fast and robust iterative method for solving large, sparsely corrupted linear systems. Our extensive experimental results support the use of the revised algorithm. △ Less

Submitted 17 October, 2024; originally announced October 2024.

arXiv:2410.12926 [pdf, other]

DEeR: Deviation Eliminating and Noise Regulating for Privacy-preserving Federated Low-rank Adaptation

Authors: Meilu Zhu, Axiu Mao, Jun Liu, Yixuan Yuan

Abstract: Integrating low-rank adaptation (LoRA) with federated learning (FL) has received widespread attention recently, aiming to adapt pretrained foundation models (FMs) to downstream medical tasks via privacy-preserving decentralized training. However, owing to the direct combination of LoRA and FL, current methods generally undergo two problems, i.e., aggregation deviation, and differential privacy (DP… ▽ More Integrating low-rank adaptation (LoRA) with federated learning (FL) has received widespread attention recently, aiming to adapt pretrained foundation models (FMs) to downstream medical tasks via privacy-preserving decentralized training. However, owing to the direct combination of LoRA and FL, current methods generally undergo two problems, i.e., aggregation deviation, and differential privacy (DP) noise amplification effect. To address these problems, we propose a novel privacy-preserving federated finetuning framework called \underline{D}eviation \underline{E}liminating and Nois\underline{e} \underline{R}egulating (DEeR). Specifically, we firstly theoretically prove that the necessary condition to eliminate aggregation deviation is guaranteing the equivalence between LoRA parameters of clients. Based on the theoretical insight, a deviation eliminator is designed to utilize alternating minimization algorithm to iteratively optimize the zero-initialized and non-zero-initialized parameter matrices of LoRA, ensuring that aggregation deviation always be zeros during training. Furthermore, we also conduct an in-depth analysis of the noise amplification effect and find that this problem is mainly caused by the ``linear relationship'' between DP noise and LoRA parameters. To suppress the noise amplification effect, we propose a noise regulator that exploits two regulator factors to decouple relationship between DP and LoRA, thereby achieving robust privacy protection and excellent finetuning performance. Additionally, we perform comprehensive ablated experiments to verify the effectiveness of the deviation eliminator and noise regulator. DEeR shows better performance on public medical datasets in comparison with state-of-the-art approaches. The code is available at https://github.com/CUHK-AIM-Group/DEeR. △ Less

Submitted 16 October, 2024; originally announced October 2024.

arXiv:2410.11747 [pdf, other]

Cloud properties in simulated galactic winds

Authors: Orlando Warren, Evan E. Schneider, S. Alwin Mao, Matthew W. Abruzzo

Abstract: In this work, we investigate the properties of a population of cool clouds in simulated galaxy outflows. Using data from the CGOLS isolated galaxy simulations, we generate catalogues of $\sim 10^5$ clouds. We describe the impact of two different supernova feedback models -- a centrally concentrated starburst and disk-wide distributed star formation -- on the resulting cloud population. In both cas… ▽ More In this work, we investigate the properties of a population of cool clouds in simulated galaxy outflows. Using data from the CGOLS isolated galaxy simulations, we generate catalogues of $\sim 10^5$ clouds. We describe the impact of two different supernova feedback models -- a centrally concentrated starburst and disk-wide distributed star formation -- on the resulting cloud population. In both cases we find that the mass distribution function $dN/dM \propto M^{-2}$, in good agreement with model predictions of turbulent fragmentation. We explore how cloud properties change with distance from the galaxy and find no qualitative distinction between the two feedback modes, although significant quantitative differences exist in attributes such as the total number of clouds, their densities, etc. We further show that both internal cloud velocities and cloud-cloud relative velocities are described well by properties of turbulent motion, despite significant bulk radial velocities. Finally, we investigate the distribution of cloud sizes in the context of recent theoretical arguments about cloud survival in winds. We find that proposed cloud survival criteria are a good predictor of cloud survival, in both the case where clouds are primarily destroyed and the case where cloud growth occurs in the outflow. △ Less

Submitted 15 October, 2024; originally announced October 2024.

Comments: 24 pages, 18 figures, submitted to The Astrophysical Journal

arXiv:2410.02081 [pdf, other]

MixLinear: Extreme Low Resource Multivariate Time Series Forecasting with 0.1K Parameters

Authors: Aitian Ma, Dongsheng Luo, Mo Sha

Abstract: Recently, there has been a growing interest in Long-term Time Series Forecasting (LTSF), which involves predicting long-term future values by analyzing a large amount of historical time-series data to identify patterns and trends. There exist significant challenges in LTSF due to its complex temporal dependencies and high computational demands. Although Transformer-based models offer high forecast… ▽ More Recently, there has been a growing interest in Long-term Time Series Forecasting (LTSF), which involves predicting long-term future values by analyzing a large amount of historical time-series data to identify patterns and trends. There exist significant challenges in LTSF due to its complex temporal dependencies and high computational demands. Although Transformer-based models offer high forecasting accuracy, they are often too compute-intensive to be deployed on devices with hardware constraints. On the other hand, the linear models aim to reduce the computational overhead by employing either decomposition methods in the time domain or compact representations in the frequency domain. In this paper, we propose MixLinear, an ultra-lightweight multivariate time series forecasting model specifically designed for resource-constrained devices. MixLinear effectively captures both temporal and frequency domain features by modeling intra-segment and inter-segment variations in the time domain and extracting frequency variations from a low-dimensional latent space in the frequency domain. By reducing the parameter scale of a downsampled $n$-length input/output one-layer linear model from $O(n^2)$ to $O(n)$, MixLinear achieves efficient computation without sacrificing accuracy. Extensive evaluations with four benchmark datasets show that MixLinear attains forecasting performance comparable to, or surpassing, state-of-the-art models with significantly fewer parameters ($0.1K$), which makes it well-suited for deployment on devices with limited computational capacity. △ Less

Submitted 2 October, 2024; originally announced October 2024.

arXiv:2410.02070 [pdf, other]

MMFNet: Multi-Scale Frequency Masking Neural Network for Multivariate Time Series Forecasting

Authors: Aitian Ma, Dongsheng Luo, Mo Sha

Abstract: Long-term Time Series Forecasting (LTSF) is critical for numerous real-world applications, such as electricity consumption planning, financial forecasting, and disease propagation analysis. LTSF requires capturing long-range dependencies between inputs and outputs, which poses significant challenges due to complex temporal dynamics and high computational demands. While linear models reduce model c… ▽ More Long-term Time Series Forecasting (LTSF) is critical for numerous real-world applications, such as electricity consumption planning, financial forecasting, and disease propagation analysis. LTSF requires capturing long-range dependencies between inputs and outputs, which poses significant challenges due to complex temporal dynamics and high computational demands. While linear models reduce model complexity by employing frequency domain decomposition, current approaches often assume stationarity and filter out high-frequency components that may contain crucial short-term fluctuations. In this paper, we introduce MMFNet, a novel model designed to enhance long-term multivariate forecasting by leveraging a multi-scale masked frequency decomposition approach. MMFNet captures fine, intermediate, and coarse-grained temporal patterns by converting time series into frequency segments at varying scales while employing a learnable mask to filter out irrelevant components adaptively. Extensive experimentation with benchmark datasets shows that MMFNet not only addresses the limitations of the existing methods but also consistently achieves good performance. Specifically, MMFNet achieves up to 6.0% reductions in the Mean Squared Error (MSE) compared to state-of-the-art models designed for multivariate forecasting tasks. △ Less

Submitted 2 October, 2024; originally announced October 2024.

arXiv:2409.17123 [pdf, other]

On the Bivariate Characteristic Polynomial of the Shuffle Lattice

Authors: Annabel Ma

Abstract: The shuffle lattice was introduced by Greene in 1988 as an idealized model for DNA mutation, when he revealed remarkable combinatorial properties of this structure. In this paper, we prove an explicit formula for the $M$-triangle of the shuffle lattice, a bivariate refinement of the characteristic polynomial, as conjectured by McConville and Mühle in 2022, and find a relation between the $M$-trian… ▽ More The shuffle lattice was introduced by Greene in 1988 as an idealized model for DNA mutation, when he revealed remarkable combinatorial properties of this structure. In this paper, we prove an explicit formula for the $M$-triangle of the shuffle lattice, a bivariate refinement of the characteristic polynomial, as conjectured by McConville and Mühle in 2022, and find a relation between the $M$-triangle and the $H$-triangle, a bivariate refinement of the rank generating function. △ Less

Submitted 25 September, 2024; originally announced September 2024.

Comments: 21 pages, 3 figures

arXiv:2409.07730 [pdf, other]

Music auto-tagging in the long tail: A few-shot approach

Authors: T. Aleksandra Ma, Alexander Lerch

Abstract: In the realm of digital music, using tags to efficiently organize and retrieve music from extensive databases is crucial for music catalog owners. Human tagging by experts is labor-intensive but mostly accurate, whereas automatic tagging through supervised learning has approached satisfying accuracy but is restricted to a predefined set of training tags. Few-shot learning offers a viable solution… ▽ More In the realm of digital music, using tags to efficiently organize and retrieve music from extensive databases is crucial for music catalog owners. Human tagging by experts is labor-intensive but mostly accurate, whereas automatic tagging through supervised learning has approached satisfying accuracy but is restricted to a predefined set of training tags. Few-shot learning offers a viable solution to expand beyond this small set of predefined tags by enabling models to learn from only a few human-provided examples to understand tag meanings and subsequently apply these tags autonomously. We propose to integrate few-shot learning methodology into multi-label music auto-tagging by using features from pre-trained models as inputs to a lightweight linear classifier, also known as a linear probe. We investigate different popular pre-trained features, as well as different few-shot parametrizations with varying numbers of classes and samples per class. Our experiments demonstrate that a simple model with pre-trained features can achieve performance close to state-of-the-art models while using significantly less training data, such as 20 samples per tag. Additionally, our linear probe performs competitively with leading models when trained on the entire training dataset. The results show that this transfer learning-based few-shot approach could effectively address the issue of automatically assigning long-tail tags with only limited labeled data. △ Less

Submitted 16 September, 2024; v1 submitted 11 September, 2024; originally announced September 2024.

Comments: Published in Audio Engineering Society NY Show 2024 as a Peer Reviewed (Category 1) paper; typos corrected

ACM Class: H.3.3

arXiv:2409.04005 [pdf, other]

Qihoo-T2X: An Efficient Proxy-Tokenized Diffusion Transformer for Text-to-Any-Task

Authors: Jing Wang, Ao Ma, Jiasong Feng, Dawei Leng, Yuhui Yin, Xiaodan Liang

Abstract: The global self-attention mechanism in diffusion transformers involves redundant computation due to the sparse and redundant nature of visual information, and the attention map of tokens within a spatial window shows significant similarity. To address this redundancy, we propose the Proxy-Tokenized Diffusion Transformer (PT-DiT), which employs sparse representative token attention (where the numbe… ▽ More The global self-attention mechanism in diffusion transformers involves redundant computation due to the sparse and redundant nature of visual information, and the attention map of tokens within a spatial window shows significant similarity. To address this redundancy, we propose the Proxy-Tokenized Diffusion Transformer (PT-DiT), which employs sparse representative token attention (where the number of representative tokens is much smaller than the total number of tokens) to model global visual information efficiently. Specifically, within each transformer block, we compute an averaging token from each spatial-temporal window to serve as a proxy token for that region. The global semantics are captured through the self-attention of these proxy tokens and then injected into all latent tokens via cross-attention. Simultaneously, we introduce window and shift window attention to address the limitations in detail modeling caused by the sparse attention mechanism. Building on the well-designed PT-DiT, we further develop the Qihoo-T2X family, which includes a variety of models for T2I, T2V, and T2MV tasks. Experimental results show that PT-DiT achieves competitive performance while reducing the computational complexity in both image and video generation tasks (e.g., a 49% reduction compared to DiT and a 34% reduction compared to PixArt-$α$). The visual exhibition and source code of Qihoo-T2X is available at https://360cvgroup.github.io/Qihoo-T2X/. △ Less

Submitted 4 October, 2024; v1 submitted 5 September, 2024; originally announced September 2024.

arXiv:2408.17086 [pdf]

Reaction Coordinates are Optimal Channels of Energy Flow

Authors: Ao Ma, Huiyu Li

Abstract: Reaction coordinates (RCs) are the few essential coordinates of a protein that control its functional processes, such as allostery, enzymatic reaction, and conformational change. They are critical for understanding protein function and provide optimal enhanced sampling of protein conformational changes and states. Since the pioneering works in the late 1990s, identifying the correct and objectivel… ▽ More Reaction coordinates (RCs) are the few essential coordinates of a protein that control its functional processes, such as allostery, enzymatic reaction, and conformational change. They are critical for understanding protein function and provide optimal enhanced sampling of protein conformational changes and states. Since the pioneering works in the late 1990s, identifying the correct and objectively provable RCs has been a central topic in molecular biophysics and chemical physics. This review summarizes the major advances in identifying RCs over the past 25 years, focusing on methods aimed at finding RCs that meet the rigorous committor criterion, widely accepted as the true RCs. Importantly, the newly developed physics-based energy flow theory and generalized work functional method provide a general and rigorous approach for identifying true RCs, revealing their physical nature as the optimal channels of energy flow in biomolecules. △ Less

Submitted 30 August, 2024; originally announced August 2024.

arXiv:2408.13547 [pdf, other]

Frontal Slice Approaches for Tensor Linear Systems

Authors: Hengrui Luo, Anna Ma

Abstract: Inspired by the row and column action methods for solving large-scale linear systems, in this work, we explore the use of frontal slices for solving tensor linear systems. In particular, this paper presents a novel approach for using frontal slices of a tensor $\mathcal{A}$ to solve tensor linear systems $\mathcal{A} * \mathcal{X} = \mathcal{B}$ where $*$ denotes the t-product. In addition, we con… ▽ More Inspired by the row and column action methods for solving large-scale linear systems, in this work, we explore the use of frontal slices for solving tensor linear systems. In particular, this paper presents a novel approach for using frontal slices of a tensor $\mathcal{A}$ to solve tensor linear systems $\mathcal{A} * \mathcal{X} = \mathcal{B}$ where $*$ denotes the t-product. In addition, we consider variations of this method, including cyclic, block, and randomized approaches, each designed to optimize performance in different operational contexts. Our primary contribution lies in the development and convergence analysis of these methods. Experimental results on synthetically generated and real-world data, including applications such as image and video deblurring, demonstrate the efficacy of our proposed approaches and validate our theoretical findings. △ Less

Submitted 24 August, 2024; originally announced August 2024.

Comments: 41 pages, 10 figures

MSC Class: 15A69; 15A72; 65F10

arXiv:2408.08189 [pdf, other]

FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

Authors: Jiasong Feng, Ao Ma, Jing Wang, Bo Cheng, Xiaodan Liang, Dawei Leng, Yuhui Yin

Abstract: Synthesizing motion-rich and temporally consistent videos remains a challenge in artificial intelligence, especially when dealing with extended durations. Existing text-to-video (T2V) models commonly employ spatial cross-attention for text control, equivalently guiding different frame generations without frame-specific textual guidance. Thus, the model's capacity to comprehend the temporal logic c… ▽ More Synthesizing motion-rich and temporally consistent videos remains a challenge in artificial intelligence, especially when dealing with extended durations. Existing text-to-video (T2V) models commonly employ spatial cross-attention for text control, equivalently guiding different frame generations without frame-specific textual guidance. Thus, the model's capacity to comprehend the temporal logic conveyed in prompts and generate videos with coherent motion is restricted. To tackle this limitation, we introduce FancyVideo, an innovative video generator that improves the existing text-control mechanism with the well-designed Cross-frame Textual Guidance Module (CTGM). Specifically, CTGM incorporates the Temporal Information Injector (TII), Temporal Affinity Refiner (TAR), and Temporal Feature Booster (TFB) at the beginning, middle, and end of cross-attention, respectively, to achieve frame-specific textual guidance. Firstly, TII injects frame-specific information from latent features into text conditions, thereby obtaining cross-frame textual conditions. Then, TAR refines the correlation matrix between cross-frame textual conditions and latent features along the time dimension. Lastly, TFB boosts the temporal consistency of latent features. Extensive experiments comprising both quantitative and qualitative evaluations demonstrate the effectiveness of FancyVideo. Our video demo, code and model are available at https://360cvgroup.github.io/FancyVideo/. △ Less

Submitted 16 August, 2024; v1 submitted 15 August, 2024; originally announced August 2024.

arXiv:2408.08105 [pdf, other]

Multimodal Causal Reasoning Benchmark: Challenging Vision Large Language Models to Discern Causal Links Across Modalities

Authors: Zhiyuan Li, Heng Wang, Dongnan Liu, Chaoyi Zhang, Ao Ma, Jieting Long, Weidong Cai

Abstract: Multimodal Large Language Models (MLLMs) have showcased exceptional Chain-of-Thought (CoT) reasoning ability in complex textual inference tasks including causal reasoning. However, will these causalities remain straightforward when crucial hints hide in visual details? If not, what factors might influence cross-modal generalization? Whether we can effectively enhance their capacity for robust caus… ▽ More Multimodal Large Language Models (MLLMs) have showcased exceptional Chain-of-Thought (CoT) reasoning ability in complex textual inference tasks including causal reasoning. However, will these causalities remain straightforward when crucial hints hide in visual details? If not, what factors might influence cross-modal generalization? Whether we can effectively enhance their capacity for robust causal inference across both text and vision? Motivated by these, we introduce MuCR - a novel Multimodal Causal Reasoning benchmark that leverages synthetic siamese images and text pairs to challenge MLLMs. Additionally, we develop tailored metrics from multiple perspectives, including image-level match, phrase-level understanding, and sentence-level explanation, to comprehensively assess MLLMs' comprehension abilities. Our experiments reveal that current MLLMs fall short in multimodal causal reasoning compared to their performance in purely textual settings. Additionally, we find that identifying visual cues across images is key to effective cross-modal generalization. Finally, we propose a VcCoT strategy that better highlights visual cues, and our results confirm its efficacy in enhancing multimodal causal reasoning. The project is available at: https://github.com/Zhiyuan-Li-John/MuCR △ Less

Submitted 25 May, 2025; v1 submitted 15 August, 2024; originally announced August 2024.

Comments: ACL2025 Findings

arXiv:2407.18496 [pdf, other]

Towards More Accurate Prediction of Human Empathy and Emotion in Text and Multi-turn Conversations by Combining Advanced NLP, Transformers-based Networks, and Linguistic Methodologies

Authors: Manisha Singh, Divy Sharma, Alonso Ma, Nora Goldfine

Abstract: Based on the WASSA 2022 Shared Task on Empathy Detection and Emotion Classification, we predict the level of empathic concern and personal distress displayed in essays. For the first stage of this project we implemented a Feed-Forward Neural Network using sentence-level embeddings as features. We experimented with four different embedding models for generating the inputs to the neural network. The… ▽ More Based on the WASSA 2022 Shared Task on Empathy Detection and Emotion Classification, we predict the level of empathic concern and personal distress displayed in essays. For the first stage of this project we implemented a Feed-Forward Neural Network using sentence-level embeddings as features. We experimented with four different embedding models for generating the inputs to the neural network. The subsequent stage builds upon the previous work and we have implemented three types of revisions. The first revision focuses on the enhancements to the model architecture and the training approach. The second revision focuses on handling class imbalance using stratified data sampling. The third revision focuses on leveraging lexical resources, where we apply four different resources to enrich the features associated with the dataset. During the final stage of this project, we have created the final end-to-end system for the primary task using an ensemble of models to revise primary task performance. Additionally, as part of the final stage, these approaches have been adapted to the WASSA 2023 Shared Task on Empathy Emotion and Personality Detection in Interactions, in which the empathic concern, emotion polarity, and emotion intensity in dyadic text conversations are predicted. △ Less

Submitted 26 July, 2024; originally announced July 2024.

arXiv:2407.18471 [pdf, other]

Constructing the CORD-19 Vaccine Dataset

Authors: Manisha Singh, Divy Sharma, Alonso Ma, Bridget Tyree, Margaret Mitchell

Abstract: We introduce new dataset 'CORD-19-Vaccination' to cater to scientists specifically looking into COVID-19 vaccine-related research. This dataset is extracted from CORD-19 dataset [Wang et al., 2020] and augmented with new columns for language detail, author demography, keywords, and topic per paper. Facebook's fastText model is used to identify languages [Joulin et al., 2016]. To establish author d… ▽ More We introduce new dataset 'CORD-19-Vaccination' to cater to scientists specifically looking into COVID-19 vaccine-related research. This dataset is extracted from CORD-19 dataset [Wang et al., 2020] and augmented with new columns for language detail, author demography, keywords, and topic per paper. Facebook's fastText model is used to identify languages [Joulin et al., 2016]. To establish author demography (author affiliation, lab/institution location, and lab/institution country columns) we processed the JSON file for each paper and then further enhanced using Google's search API to determine country values. 'Yake' was used to extract keywords from the title, abstract, and body of each paper and the LDA (Latent Dirichlet Allocation) algorithm was used to add topic information [Campos et al., 2020, 2018a,b]. To evaluate the dataset, we demonstrate a question-answering task like the one used in the CORD-19 Kaggle challenge [Goldbloom et al., 2022]. For further evaluation, sequential sentence classification was performed on each paper's abstract using the model from Dernoncourt et al. [2016]. We partially hand annotated the training dataset and used a pre-trained BERT-PubMed layer. 'CORD- 19-Vaccination' contains 30k research papers and can be immensely valuable for NLP research such as text mining, information extraction, and question answering, specific to the domain of COVID-19 vaccine research. △ Less

Submitted 25 July, 2024; originally announced July 2024.

arXiv:2407.16748 [pdf, other]

doi 10.1051/0004-6361/202347459

The dispersion measure and rotation measure from fast radio burst host galaxies based on the IllustrisTNG50 simulation

Authors: Timea Orsolya Kovacs, Sui Ann Mao, Aritra Basu, Yik Ki Ma, Laura G. Spitler, Charles R. H. Walker

Abstract: Fast radio bursts (FRB) will become important cosmological tools, as the number of observed FRBs is increasing rapidly with more surveys being carried out. A large sample of FRBs with dispersion measures (DM) and rotation measures (RM) can be used to study the intergalactic magnetic field. However, the observed DM and RM of FRBs have multiple contributors which must be quantified to obtain the int… ▽ More Fast radio bursts (FRB) will become important cosmological tools, as the number of observed FRBs is increasing rapidly with more surveys being carried out. A large sample of FRBs with dispersion measures (DM) and rotation measures (RM) can be used to study the intergalactic magnetic field. However, the observed DM and RM of FRBs have multiple contributors which must be quantified to obtain the intergalactic medium's (IGM) DM and RM. In this paper, we estimate one such contribution to DM and RM: that of FRB host galaxies. We show how it changes with redshift, galaxy type, and the stellar mass of the galaxies, inclination, and FRB's projected offset. Using the IllustrisTNG50 simulations, we selected 16500 galaxies at redshifts of 0<=z<=2, with stellar masses in the range 9<=log(M*/Msun)<=12. In each galaxy, we calculate the DM and RM contributions of 1000 sightlines, and construct DM and RM probability density functions. We find that the rest frame DM distributions of all galaxies at a given redshift can be fitted by a lognormal function, and the rest frame RM distribution is symmetric around 0 rad m$^{-2}$, and can be fitted by the combination of a Lorentzian and two Gaussian functions. The parameters of these functions change for different subsets of galaxies with different redshift, stellar mass, inclination, and FRB offset. These changes are due to an increasing $n_e$ with redshift, SFR, and stellar mass, and we find a more ordered B field at lower z compared to higher z, suggested by more galaxies with B field reversals and B fields dominated by random B field at higher z. We estimate the FRB host DM and RM contributions, which can be used in the future to isolate the IGM's contribution from the observed DM and RM of FRBs. We predict that to constrain an $σ_{\rm RM,IGM}$ of 2 rad m$^{-2}$ to 95% confidence level we need to observe 95000 FRBs at z=0.5, but only 9500 FRBs at z=2. △ Less

Submitted 23 July, 2024; originally announced July 2024.

Comments: 24 pages, 15 figures Accepted for publication in A&A

Journal ref: A&A 690, A47 (2024)

Showing 1–50 of 289 results for author: Maa, A