-
Out-of-Sample Embedding with Proximity Data: Projection versus Restricted Reconstruction
Authors:
Michael W. Trosset,
Kaiyi Tan,
Minh Tang,
Carey E. Priebe
Abstract:
The problem of using proximity (similarity or dissimilarity) data for the purpose of "adding a point to a vector diagram" was first studied by J.C. Gower in 1968. Since then, a number of methods -- mostly kernel methods -- have been proposed for solving what has come to be called the problem of *out-of-sample embedding*. We survey the various kernel methods that we have encountered and show that e…
▽ More
The problem of using proximity (similarity or dissimilarity) data for the purpose of "adding a point to a vector diagram" was first studied by J.C. Gower in 1968. Since then, a number of methods -- mostly kernel methods -- have been proposed for solving what has come to be called the problem of *out-of-sample embedding*. We survey the various kernel methods that we have encountered and show that each can be derived from one or the other of two competing strategies: *projection* or *restricted reconstruction*. Projection can be analogized to a well-known formula for adding a point to a principal component analysis. Restricted reconstruction poses a different challenge: how to best approximate redoing the entire multivariate analysis while holding fixed the vector diagram that was previously obtained. This strategy results in a nonlinear optimization problem that can be simplified to a unidimensional search. Various circumstances may warrant either projection or restricted reconstruction.
△ Less
Submitted 10 May, 2025;
originally announced May 2025.
-
A New Scope and Domain Measure Comparison Method for Global Convergence Analysis in Evolutionary Computation
Authors:
Liu-Yue Luo,
Zhi-Hui Zhan,
Kay Chen Tan,
Jun Zhang
Abstract:
Convergence analysis is a fundamental research topic in evolutionary computation (EC). The commonly used analysis method models the EC algorithm as a homogeneous Markov chain for analysis, which is not always suitable for different EC variants, and also sometimes causes misuse and confusion due to their complex process. In this article, we categorize the existing researches on convergence analysis…
▽ More
Convergence analysis is a fundamental research topic in evolutionary computation (EC). The commonly used analysis method models the EC algorithm as a homogeneous Markov chain for analysis, which is not always suitable for different EC variants, and also sometimes causes misuse and confusion due to their complex process. In this article, we categorize the existing researches on convergence analysis in EC algorithms into stable convergence and global convergence, and then prove that the conditions for these two convergence properties are somehow mutually exclusive. Inspired by this proof, we propose a new scope and domain measure comparison (SDMC) method for analyzing the global convergence of EC algorithms and provide a rigorous proof of its necessity and sufficiency as an alternative condition. Unlike traditional methods, the SDMC method is straightforward, bypasses Markov chain modeling, and minimizes errors from misapplication as it only focuses on the measure of the algorithm's search scope. We apply SDMC to two algorithm types that are unsuitable for traditional methods, confirming its effectiveness in global convergence analysis. Furthermore, we apply the SDMC method to explore the gene targeting mechanism's impact on the global convergence in large-scale global optimization, deriving insights into how to design EC algorithms that guarantee global convergence and exploring how theoretical analysis can guide EC algorithm design.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
Actor-Critics Can Achieve Optimal Sample Efficiency
Authors:
Kevin Tan,
Wei Fan,
Yuting Wei
Abstract:
Actor-critic algorithms have become a cornerstone in reinforcement learning (RL), leveraging the strengths of both policy-based and value-based methods. Despite recent progress in understanding their statistical efficiency, no existing work has successfully learned an $ε$-optimal policy with a sample complexity of $O(1/ε^2)$ trajectories with general function approximation when strategic explorati…
▽ More
Actor-critic algorithms have become a cornerstone in reinforcement learning (RL), leveraging the strengths of both policy-based and value-based methods. Despite recent progress in understanding their statistical efficiency, no existing work has successfully learned an $ε$-optimal policy with a sample complexity of $O(1/ε^2)$ trajectories with general function approximation when strategic exploration is necessary.
We address this open problem by introducing a novel actor-critic algorithm that attains a sample-complexity of $O(dH^5 \log|\mathcal{A}|/ε^2 + d H^4 \log|\mathcal{F}|/ ε^2)$ trajectories, and accompanying $\sqrt{T}$ regret when the Bellman eluder dimension $d$ does not increase with $T$ at more than a $\log T$ rate.
Here, $\mathcal{F}$ is the critic function class, $\mathcal{A}$ is the action space, and $H$ is the horizon in the finite horizon MDP setting. Our algorithm integrates optimism, off-policy critic estimation targeting the optimal Q-function, and rare-switching policy resets.
We extend this to the setting of Hybrid RL, showing that initializing the critic with offline data yields sample efficiency gains compared to purely offline or online RL. Further, utilizing access to offline data, we provide a \textit{non-optimistic} provably efficient actor-critic algorithm that only additionally requires $N_{\text{off}} \geq c_{\text{off}}^*dH^4/ε^2$ in exchange for omitting optimism, where $c_{\text{off}}^*$ is the single-policy concentrability coefficient and $N_{\text{off}}$ is the number of offline samples. This addresses another open problem in the literature. We further provide numerical experiments to support our theoretical findings.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
6D Pose Estimation on Spoons and Hands
Authors:
Kevin Tan,
Fan Yang,
Yuhao Chen
Abstract:
Accurate dietary monitoring is essential for promoting healthier eating habits. A key area of research is how people interact and consume food using utensils and hands. By tracking their position and orientation, it is possible to estimate the volume of food being consumed, or monitor eating behaviours, highly useful insights into nutritional intake that can be more reliable than popular methods s…
▽ More
Accurate dietary monitoring is essential for promoting healthier eating habits. A key area of research is how people interact and consume food using utensils and hands. By tracking their position and orientation, it is possible to estimate the volume of food being consumed, or monitor eating behaviours, highly useful insights into nutritional intake that can be more reliable than popular methods such as self-reporting. Hence, this paper implements a system that analyzes stationary video feed of people eating, using 6D pose estimation to track hand and spoon movements to capture spatial position and orientation. In doing so, we examine the performance of two state-of-the-art (SOTA) video object segmentation (VOS) models, both quantitatively and qualitatively, and identify main sources of error within the system.
△ Less
Submitted 4 May, 2025;
originally announced May 2025.
-
RWKV-X: A Linear Complexity Hybrid Language Model
Authors:
Haowen Hou,
Zhiyi Huang,
Kaifeng Tan,
Rongchang Lu,
Fei Richard Yu
Abstract:
In this paper, we introduce RWKV-X, a novel hybrid architecture that combines the efficiency of RWKV for short-range modeling with a sparse attention mechanism designed to capture long-range context. Unlike previous hybrid approaches that rely on full attention layers and retain quadratic complexity, RWKV-X achieves linear-time complexity in training and constant-time complexity in inference decod…
▽ More
In this paper, we introduce RWKV-X, a novel hybrid architecture that combines the efficiency of RWKV for short-range modeling with a sparse attention mechanism designed to capture long-range context. Unlike previous hybrid approaches that rely on full attention layers and retain quadratic complexity, RWKV-X achieves linear-time complexity in training and constant-time complexity in inference decoding. We demonstrate that RWKV-X, when continually pretrained on 64K-token sequences, achieves near-perfect accuracy on the 64K passkey retrieval benchmark. It consistently outperforms prior RWKV-7 models on long-context benchmarks, while maintaining strong performance on short-context tasks. These results highlight RWKV-X as a scalable and efficient backbone for general-purpose language modeling, capable of decoding sequences up to 1 million tokens with stable speed and memory usage. To facilitate further research and analysis, we have made the checkpoints and the associated code publicly accessible at: https://github.com/howard-hou/RWKV-X.
△ Less
Submitted 8 May, 2025; v1 submitted 30 April, 2025;
originally announced April 2025.
-
Plug-and-Play Physics-informed Learning using Uncertainty Quantified Port-Hamiltonian Models
Authors:
Kaiyuan Tan,
Peilun Li,
Jun Wang,
Thomas Beckers
Abstract:
The ability to predict trajectories of surrounding agents and obstacles is a crucial component in many robotic applications. Data-driven approaches are commonly adopted for state prediction in scenarios where the underlying dynamics are unknown. However, the performance, reliability, and uncertainty of data-driven predictors become compromised when encountering out-of-distribution observations rel…
▽ More
The ability to predict trajectories of surrounding agents and obstacles is a crucial component in many robotic applications. Data-driven approaches are commonly adopted for state prediction in scenarios where the underlying dynamics are unknown. However, the performance, reliability, and uncertainty of data-driven predictors become compromised when encountering out-of-distribution observations relative to the training data. In this paper, we introduce a Plug-and-Play Physics-Informed Machine Learning (PnP-PIML) framework to address this challenge. Our method employs conformal prediction to identify outlier dynamics and, in that case, switches from a nominal predictor to a physics-consistent model, namely distributed Port-Hamiltonian systems (dPHS). We leverage Gaussian processes to model the energy function of the dPHS, enabling not only the learning of system dynamics but also the quantification of predictive uncertainty through its Bayesian nature. In this way, the proposed framework produces reliable physics-informed predictions even for the out-of-distribution scenarios.
△ Less
Submitted 24 April, 2025;
originally announced April 2025.
-
QM-ToT: A Medical Tree of Thoughts Reasoning Framework for Quantized Model
Authors:
Zongxian Yang,
Jiayu Qian,
Zhi-An Huang,
Kay Chen Tan
Abstract:
Large language models (LLMs) face significant challenges in specialized biomedical tasks due to the inherent complexity of medical reasoning and the sensitive nature of clinical data. Existing LLMs often struggle with intricate medical terminology and the need for accurate clinical insights, leading to performance reduction when quantized for resource-constrained deployment. To address these issue…
▽ More
Large language models (LLMs) face significant challenges in specialized biomedical tasks due to the inherent complexity of medical reasoning and the sensitive nature of clinical data. Existing LLMs often struggle with intricate medical terminology and the need for accurate clinical insights, leading to performance reduction when quantized for resource-constrained deployment. To address these issues, we propose Quantized Medical Tree of Thought (QM-ToT), a path-based reasoning framework. QM-ToT leverages a Tree of Thought (ToT) reasoning approach to decompose complex medical problems into manageable subtasks, coupled with evaluator assessment layers. This framework facilitates substantial performance improvements in INT4-quantized models on the challenging MedQAUSMLE dataset. Specifically, we demonstrate a remarkable accuracy increase from 34% to 50% for the LLaMA2-70b model and from 58.77% to 69.49% for LLaMA-3.1-8b. Besides, we also proposed an effect data distillation method based on ToT. Compared to the traditional distillation method, we achieved an improvement of 86. 27% while using only 3.9% of the data.This work, for the first time, showcases the potential of ToT to significantly enhance performance on complex biomedical tasks, establishing a crucial foundation for future advances in deploying high-performing quantized LLM in resource-limited medical settings.
△ Less
Submitted 13 April, 2025;
originally announced April 2025.
-
M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization?
Authors:
Haolong Yan,
Kaijun Tan,
Yeqing Shen,
Xin Huang,
Zheng Ge,
Xiangyu Zhang,
Si Li,
Daxin Jiang
Abstract:
We investigate a critical yet under-explored question in Large Vision-Language Models (LVLMs): Do LVLMs genuinely comprehend interleaved image-text in the document? Existing document understanding benchmarks often assess LVLMs using question-answer formats, which are information-sparse and difficult to guarantee the coverage of long-range dependencies. To address this issue, we introduce a novel a…
▽ More
We investigate a critical yet under-explored question in Large Vision-Language Models (LVLMs): Do LVLMs genuinely comprehend interleaved image-text in the document? Existing document understanding benchmarks often assess LVLMs using question-answer formats, which are information-sparse and difficult to guarantee the coverage of long-range dependencies. To address this issue, we introduce a novel and challenging Multimodal Document Summarization Benchmark (M-DocSum-Bench), which comprises 500 high-quality arXiv papers, along with interleaved multimodal summaries aligned with human preferences. M-DocSum-Bench is a reference-based generation task and necessitates the generation of interleaved image-text summaries using provided reference images, thereby simultaneously evaluating capabilities in understanding, reasoning, localization, and summarization within complex multimodal document scenarios. To facilitate this benchmark, we develop an automated framework to construct summaries and propose a fine-grained evaluation method called M-DocEval. Moreover, we further develop a robust summarization baseline, i.e., M-DocSum-7B, by progressive two-stage training with diverse instruction and preference data. The extensive results on our M-DocSum-Bench reveal that the leading LVLMs struggle to maintain coherence and accurately integrate information within long and interleaved contexts, often exhibiting confusion between similar images and a lack of robustness. Notably, M-DocSum-7B achieves state-of-the-art performance compared to larger and closed-source models (including GPT-4o, Gemini Pro, Claude-3.5-Sonnet and Qwen2.5-VL-72B, etc.), demonstrating the potential of LVLMs for improved interleaved image-text understanding. The code, data, and models are available at https://github.com/stepfun-ai/M-DocSum-Bench.
△ Less
Submitted 27 March, 2025;
originally announced March 2025.
-
A Theoretical Analysis of Analogy-Based Evolutionary Transfer Optimization
Authors:
Xiaoming Xue,
Liang Feng,
Yinglan Feng,
Rui Liu,
Kai Zhang,
Kay Chen Tan
Abstract:
Evolutionary transfer optimization (ETO) has been gaining popularity in research over the years due to its outstanding knowledge transfer ability to address various challenges in optimization. However, a pressing issue in this field is that the invention of new ETO algorithms has far outpaced the development of fundamental theories needed to clearly understand the key factors contributing to the s…
▽ More
Evolutionary transfer optimization (ETO) has been gaining popularity in research over the years due to its outstanding knowledge transfer ability to address various challenges in optimization. However, a pressing issue in this field is that the invention of new ETO algorithms has far outpaced the development of fundamental theories needed to clearly understand the key factors contributing to the success of these algorithms for effective generalization. In response to this challenge, this study aims to establish theoretical foundations for analogy-based ETO, specifically to support various algorithms that frequently reference a key concept known as similarity. First, we introduce analogical reasoning and link its subprocesses to three key issues in ETO. Then, we develop theories for analogy-based knowledge transfer, rooted in the principles that underlie the subprocesses. Afterwards, we present two theorems related to the performance gain of analogy-based knowledge transfer, namely unconditionally nonnegative performance gain and conditionally positive performance gain, to theoretically demonstrate the effectiveness of various analogy-based ETO methods. Last but not least, we offer a novel insight into analogy-based ETO that interprets its conditional superiority over traditional evolutionary optimization through the lens of the no free lunch theorem for optimization.
△ Less
Submitted 27 March, 2025;
originally announced March 2025.
-
RWKV-7 "Goose" with Expressive Dynamic State Evolution
Authors:
Bo Peng,
Ruichong Zhang,
Daniel Goldstein,
Eric Alcaide,
Xingjian Du,
Haowen Hou,
Jiaju Lin,
Jiaxing Liu,
Janna Lu,
William Merrill,
Guangyu Song,
Kaifeng Tan,
Saiteja Utpala,
Nathan Wilce,
Johan S. Wind,
Tianyi Wu,
Daniel Wuttke,
Christian Zhou-Zheng
Abstract:
We present RWKV-7 "Goose", a new sequence modeling architecture with constant memory usage and constant inference time per token. Despite being trained on dramatically fewer tokens than other top models, our 2.9 billion parameter language model achieves a new 3B SoTA on multilingual tasks and matches the current 3B SoTA on English language downstream performance. RWKV-7 introduces a newly generali…
▽ More
We present RWKV-7 "Goose", a new sequence modeling architecture with constant memory usage and constant inference time per token. Despite being trained on dramatically fewer tokens than other top models, our 2.9 billion parameter language model achieves a new 3B SoTA on multilingual tasks and matches the current 3B SoTA on English language downstream performance. RWKV-7 introduces a newly generalized formulation of the delta rule with vector-valued gating and in-context learning rates, as well as a relaxed value replacement rule. We show that RWKV-7 can perform state tracking and recognize all regular languages, while retaining parallelizability of training. This exceeds the capabilities of Transformers under standard complexity conjectures, which are limited to $\mathsf{TC}^0$. To demonstrate RWKV-7's language modeling capability, we also present an extended open source 3.1 trillion token multilingual corpus, and train four RWKV-7 models ranging from 0.19 billion to 2.9 billion parameters on this dataset.
To foster openness, reproduction, and adoption, we release our models and dataset component listing at https://huggingface.co/RWKV, and our training and inference code at https://github.com/RWKV/RWKV-LM all under the Apache 2.0 License.
△ Less
Submitted 30 March, 2025; v1 submitted 18 March, 2025;
originally announced March 2025.
-
Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation Model
Authors:
Haoyang Huang,
Guoqing Ma,
Nan Duan,
Xing Chen,
Changyi Wan,
Ranchen Ming,
Tianyu Wang,
Bo Wang,
Zhiying Lu,
Aojie Li,
Xianfang Zeng,
Xinhao Zhang,
Gang Yu,
Yuhe Yin,
Qiling Wu,
Wen Sun,
Kang An,
Xin Han,
Deshan Sun,
Wei Ji,
Bizhu Huang,
Brian Li,
Chenfei Wu,
Guanzhe Huang,
Huixin Xiong
, et al. (29 additional authors not shown)
Abstract:
We present Step-Video-TI2V, a state-of-the-art text-driven image-to-video generation model with 30B parameters, capable of generating videos up to 102 frames based on both text and image inputs. We build Step-Video-TI2V-Eval as a new benchmark for the text-driven image-to-video task and compare Step-Video-TI2V with open-source and commercial TI2V engines using this dataset. Experimental results de…
▽ More
We present Step-Video-TI2V, a state-of-the-art text-driven image-to-video generation model with 30B parameters, capable of generating videos up to 102 frames based on both text and image inputs. We build Step-Video-TI2V-Eval as a new benchmark for the text-driven image-to-video task and compare Step-Video-TI2V with open-source and commercial TI2V engines using this dataset. Experimental results demonstrate the state-of-the-art performance of Step-Video-TI2V in the image-to-video generation task. Both Step-Video-TI2V and Step-Video-TI2V-Eval are available at https://github.com/stepfun-ai/Step-Video-TI2V.
△ Less
Submitted 14 March, 2025;
originally announced March 2025.
-
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Authors:
Guoqing Ma,
Haoyang Huang,
Kun Yan,
Liangyu Chen,
Nan Duan,
Shengming Yin,
Changyi Wan,
Ranchen Ming,
Xiaoniu Song,
Xing Chen,
Yu Zhou,
Deshan Sun,
Deyu Zhou,
Jian Zhou,
Kaijun Tan,
Kang An,
Mei Chen,
Wei Ji,
Qiling Wu,
Wen Sun,
Xin Han,
Yanan Wei,
Zheng Ge,
Aojie Li,
Bin Wang
, et al. (90 additional authors not shown)
Abstract:
We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length. A deep compression Variational Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios, while maintaining exceptional video reconstruction quality. User prompts are encoded…
▽ More
We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length. A deep compression Variational Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios, while maintaining exceptional video reconstruction quality. User prompts are encoded using two bilingual text encoders to handle both English and Chinese. A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames. A video-based DPO approach, Video-DPO, is applied to reduce artifacts and improve the visual quality of the generated videos. We also detail our training strategies and share key observations and insights. Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its state-of-the-art text-to-video quality when compared with both open-source and commercial engines. Additionally, we discuss the limitations of current diffusion-based model paradigm and outline future directions for video foundation models. We make both Step-Video-T2V and Step-Video-T2V-Eval available at https://github.com/stepfun-ai/Step-Video-T2V. The online version can be accessed from https://yuewen.cn/videos as well. Our goal is to accelerate the innovation of video foundation models and empower video content creators.
△ Less
Submitted 24 February, 2025; v1 submitted 14 February, 2025;
originally announced February 2025.
-
Spiking Neural Networks for Temporal Processing: Status Quo and Future Prospects
Authors:
Chenxiang Ma,
Xinyi Chen,
Yanchen Li,
Qu Yang,
Yujie Wu,
Guoqi Li,
Gang Pan,
Huajin Tang,
Kay Chen Tan,
Jibin Wu
Abstract:
Temporal processing is fundamental for both biological and artificial intelligence systems, as it enables the comprehension of dynamic environments and facilitates timely responses. Spiking Neural Networks (SNNs) excel in handling such data with high efficiency, owing to their rich neuronal dynamics and sparse activity patterns. Given the recent surge in the development of SNNs, there is an urgent…
▽ More
Temporal processing is fundamental for both biological and artificial intelligence systems, as it enables the comprehension of dynamic environments and facilitates timely responses. Spiking Neural Networks (SNNs) excel in handling such data with high efficiency, owing to their rich neuronal dynamics and sparse activity patterns. Given the recent surge in the development of SNNs, there is an urgent need for a comprehensive evaluation of their temporal processing capabilities. In this paper, we first conduct an in-depth assessment of commonly used neuromorphic benchmarks, revealing critical limitations in their ability to evaluate the temporal processing capabilities of SNNs. To bridge this gap, we further introduce a benchmark suite consisting of three temporal processing tasks characterized by rich temporal dynamics across multiple timescales. Utilizing this benchmark suite, we perform a thorough evaluation of recently introduced SNN approaches to elucidate the current status of SNNs in temporal processing. Our findings indicate significant advancements in recently developed spiking neuron models and neural architectures regarding their temporal processing capabilities, while also highlighting a performance gap in handling long-range dependencies when compared to state-of-the-art non-spiking models. Finally, we discuss the key challenges and outline potential avenues for future research.
△ Less
Submitted 13 February, 2025;
originally announced February 2025.
-
Regulatory Science Innovation for Generative AI and Large Language Models in Health and Medicine: A Global Call for Action
Authors:
Jasmine Chiat Ling Ong,
Yilin Ning,
Mingxuan Liu,
Yian Ma,
Zhao Liang,
Kuldev Singh,
Robert T Chang,
Silke Vogel,
John CW Lim,
Iris Siu Kwan Tan,
Oscar Freyer,
Stephen Gilbert,
Danielle S Bitterman,
Xiaoxuan Liu,
Alastair K Denniston,
Nan Liu
Abstract:
The integration of generative AI (GenAI) and large language models (LLMs) in healthcare presents both unprecedented opportunities and challenges, necessitating innovative regulatory approaches. GenAI and LLMs offer broad applications, from automating clinical workflows to personalizing diagnostics. However, the non-deterministic outputs, broad functionalities and complex integration of GenAI and L…
▽ More
The integration of generative AI (GenAI) and large language models (LLMs) in healthcare presents both unprecedented opportunities and challenges, necessitating innovative regulatory approaches. GenAI and LLMs offer broad applications, from automating clinical workflows to personalizing diagnostics. However, the non-deterministic outputs, broad functionalities and complex integration of GenAI and LLMs challenge existing medical device regulatory frameworks, including the total product life cycle (TPLC) approach. Here we discuss the constraints of the TPLC approach to GenAI and LLM-based medical device regulation, and advocate for global collaboration in regulatory science research. This serves as the foundation for developing innovative approaches including adaptive policies and regulatory sandboxes, to test and refine governance in real-world settings. International harmonization, as seen with the International Medical Device Regulators Forum, is essential to manage implications of LLM on global health, including risks of widening health inequities driven by inherent model biases. By engaging multidisciplinary expertise, prioritizing iterative, data-driven approaches, and focusing on the needs of diverse populations, global regulatory science research enables the responsible and equitable advancement of LLM innovations in healthcare.
△ Less
Submitted 27 January, 2025;
originally announced February 2025.
-
scBIT: Integrating Single-cell Transcriptomic Data into fMRI-based Prediction for Alzheimer's Disease Diagnosis
Authors:
Yu-An Huang,
Yao Hu,
Yue-Chao Li,
Xiyue Cao,
Xinyuan Li,
Kay Chen Tan,
Zhu-Hong You,
Zhi-An Huang
Abstract:
Functional MRI (fMRI) and single-cell transcriptomics are pivotal in Alzheimer's disease (AD) research, each providing unique insights into neural function and molecular mechanisms. However, integrating these complementary modalities remains largely unexplored. Here, we introduce scBIT, a novel method for enhancing AD prediction by combining fMRI with single-nucleus RNA (snRNA). scBIT leverages sn…
▽ More
Functional MRI (fMRI) and single-cell transcriptomics are pivotal in Alzheimer's disease (AD) research, each providing unique insights into neural function and molecular mechanisms. However, integrating these complementary modalities remains largely unexplored. Here, we introduce scBIT, a novel method for enhancing AD prediction by combining fMRI with single-nucleus RNA (snRNA). scBIT leverages snRNA as an auxiliary modality, significantly improving fMRI-based prediction models and providing comprehensive interpretability. It employs a sampling strategy to segment snRNA data into cell-type-specific gene networks and utilizes a self-explainable graph neural network to extract critical subgraphs. Additionally, we use demographic and genetic similarities to pair snRNA and fMRI data across individuals, enabling robust cross-modal learning. Extensive experiments validate scBIT's effectiveness in revealing intricate brain region-gene associations and enhancing diagnostic prediction accuracy. By advancing brain imaging transcriptomics to the single-cell level, scBIT sheds new light on biomarker discovery in AD research. Experimental results show that incorporating snRNA data into the scBIT model significantly boosts accuracy, improving binary classification by 3.39% and five-class classification by 26.59%. The codes were implemented in Python and have been released on GitHub (https://github.com/77YQ77/scBIT) and Zenodo (https://zenodo.org/records/11599030) with detailed instructions.
△ Less
Submitted 4 February, 2025;
originally announced February 2025.
-
Fast Direct: Query-Efficient Online Black-box Guidance for Diffusion-model Target Generation
Authors:
Kim Yong Tan,
Yueming Lyu,
Ivor Tsang,
Yew-Soon Ong
Abstract:
Guided diffusion-model generation is a promising direction for customizing the generation process of a pre-trained diffusion model to address specific downstream tasks. Existing guided diffusion models either rely on training the guidance model with pre-collected datasets or require the objective functions to be differentiable. However, for most real-world tasks, offline datasets are often unavail…
▽ More
Guided diffusion-model generation is a promising direction for customizing the generation process of a pre-trained diffusion model to address specific downstream tasks. Existing guided diffusion models either rely on training the guidance model with pre-collected datasets or require the objective functions to be differentiable. However, for most real-world tasks, offline datasets are often unavailable, and their objective functions are often not differentiable, such as image generation with human preferences, molecular generation for drug discovery, and material design. Thus, we need an $\textbf{online}$ algorithm capable of collecting data during runtime and supporting a $\textbf{black-box}$ objective function. Moreover, the $\textbf{query efficiency}$ of the algorithm is also critical because the objective evaluation of the query is often expensive in real-world scenarios. In this work, we propose a novel and simple algorithm, $\textbf{Fast Direct}$, for query-efficient online black-box target generation. Our Fast Direct builds a pseudo-target on the data manifold to update the noise sequence of the diffusion model with a universal direction, which is promising to perform query-efficient guided generation. Extensive experiments on twelve high-resolution ($\small {1024 \times 1024}$) image target generation tasks and six 3D-molecule target generation tasks show $\textbf{6}\times$ up to $\textbf{10}\times$ query efficiency improvement and $\textbf{11}\times$ up to $\textbf{44}\times$ query efficiency improvement, respectively. Our implementation is publicly available at: https://github.com/kimyong95/guide-stable-diffusion/tree/fast-direct
△ Less
Submitted 29 March, 2025; v1 submitted 2 February, 2025;
originally announced February 2025.
-
Efficient Audiovisual Speech Processing via MUTUD: Multimodal Training and Unimodal Deployment
Authors:
Joanna Hong,
Sanjeel Parekh,
Honglie Chen,
Jacob Donley,
Ke Tan,
Buye Xu,
Anurag Kumar
Abstract:
Building reliable speech systems often requires combining multiple modalities, like audio and visual cues. While such multimodal solutions frequently lead to improvements in performance and may even be critical in certain cases, they come with several constraints such as increased sensory requirements, computational cost, and modality synchronization, to mention a few. These challenges constrain t…
▽ More
Building reliable speech systems often requires combining multiple modalities, like audio and visual cues. While such multimodal solutions frequently lead to improvements in performance and may even be critical in certain cases, they come with several constraints such as increased sensory requirements, computational cost, and modality synchronization, to mention a few. These challenges constrain the direct uses of these multimodal solutions in real-world applications. In this work, we develop approaches where the learning happens with all available modalities but the deployment or inference is done with just one or reduced modalities. To do so, we propose a Multimodal Training and Unimodal Deployment (MUTUD) framework which includes a Temporally Aligned Modality feature Estimation (TAME) module that can estimate information from missing modality using modalities present during inference. This innovative approach facilitates the integration of information across different modalities, enhancing the overall inference process by leveraging the strengths of each modality to compensate for the absence of certain modalities during inference. We apply MUTUD to various audiovisual speech tasks and show that it can reduce the performance gap between the multimodal and corresponding unimodal models to a considerable extent. MUTUD can achieve this while reducing the model size and compute compared to multimodal models, in some cases by almost 80%.
△ Less
Submitted 30 January, 2025;
originally announced January 2025.
-
EvoRL: A GPU-accelerated Framework for Evolutionary Reinforcement Learning
Authors:
Bowen Zheng,
Ran Cheng,
Kay Chen Tan
Abstract:
Evolutionary Reinforcement Learning (EvoRL) has emerged as a promising approach to overcoming the limitations of traditional reinforcement learning (RL) by integrating the Evolutionary Computation (EC) paradigm with RL. However, the population-based nature of EC significantly increases computational costs, thereby restricting the exploration of algorithmic design choices and scalability in large-s…
▽ More
Evolutionary Reinforcement Learning (EvoRL) has emerged as a promising approach to overcoming the limitations of traditional reinforcement learning (RL) by integrating the Evolutionary Computation (EC) paradigm with RL. However, the population-based nature of EC significantly increases computational costs, thereby restricting the exploration of algorithmic design choices and scalability in large-scale settings. To address this challenge, we introduce $\texttt{$\textbf{EvoRL}$}$, the first end-to-end EvoRL framework optimized for GPU acceleration. The framework executes the entire training pipeline on accelerators, including environment simulations and EC processes, leveraging hierarchical parallelism through vectorization and compilation techniques to achieve superior speed and scalability. This design enables the efficient training of large populations on a single machine. In addition to its performance-oriented design, $\texttt{$\textbf{EvoRL}$}$ offers a comprehensive platform for EvoRL research, encompassing implementations of traditional RL algorithms (e.g., A2C, PPO, DDPG, TD3, SAC), Evolutionary Algorithms (e.g., CMA-ES, OpenES, ARS), and hybrid EvoRL paradigms such as Evolutionary-guided RL (e.g., ERL, CEM-RL) and Population-Based AutoRL (e.g., PBT). The framework's modular architecture and user-friendly interface allow researchers to seamlessly integrate new components, customize algorithms, and conduct fair benchmarking and ablation studies. The project is open-source and available at: https://github.com/EMI-Group/evorl.
△ Less
Submitted 2 February, 2025; v1 submitted 25 January, 2025;
originally announced January 2025.
-
I Can Find You in Seconds! Leveraging Large Language Models for Code Authorship Attribution
Authors:
Soohyeon Choi,
Yong Kiam Tan,
Mark Huasong Meng,
Mohamed Ragab,
Soumik Mondal,
David Mohaisen,
Khin Mi Mi Aung
Abstract:
Source code authorship attribution is important in software forensics, plagiarism detection, and protecting software patch integrity. Existing techniques often rely on supervised machine learning, which struggles with generalization across different programming languages and coding styles due to the need for large labeled datasets. Inspired by recent advances in natural language authorship analysi…
▽ More
Source code authorship attribution is important in software forensics, plagiarism detection, and protecting software patch integrity. Existing techniques often rely on supervised machine learning, which struggles with generalization across different programming languages and coding styles due to the need for large labeled datasets. Inspired by recent advances in natural language authorship analysis using large language models (LLMs), which have shown exceptional performance without task-specific tuning, this paper explores the use of LLMs for source code authorship attribution.
We present a comprehensive study demonstrating that state-of-the-art LLMs can successfully attribute source code authorship across different languages. LLMs can determine whether two code snippets are written by the same author with zero-shot prompting, achieving a Matthews Correlation Coefficient (MCC) of 0.78, and can attribute code authorship from a small set of reference code snippets via few-shot learning, achieving MCC of 0.77. Additionally, LLMs show some adversarial robustness against misattribution attacks.
Despite these capabilities, we found that naive prompting of LLMs does not scale well with a large number of authors due to input token limitations. To address this, we propose a tournament-style approach for large-scale attribution. Evaluating this approach on datasets of C++ (500 authors, 26,355 samples) and Java (686 authors, 55,267 samples) code from GitHub, we achieve classification accuracy of up to 65% for C++ and 68.7% for Java using only one reference per author. These results open new possibilities for applying LLMs to code authorship attribution in cybersecurity and software engineering.
△ Less
Submitted 14 January, 2025;
originally announced January 2025.
-
ParetoLens: A Visual Analytics Framework for Exploring Solution Sets of Multi-objective Evolutionary Algorithms
Authors:
Yuxin Ma,
Zherui Zhang,
Ran Cheng,
Yaochu Jin,
Kay Chen Tan
Abstract:
In the domain of multi-objective optimization, evolutionary algorithms are distinguished by their capability to generate a diverse population of solutions that navigate the trade-offs inherent among competing objectives. This has catalyzed the ascension of evolutionary multi-objective optimization (EMO) as a prevalent approach. Despite the effectiveness of the EMO paradigm, the analysis of resulta…
▽ More
In the domain of multi-objective optimization, evolutionary algorithms are distinguished by their capability to generate a diverse population of solutions that navigate the trade-offs inherent among competing objectives. This has catalyzed the ascension of evolutionary multi-objective optimization (EMO) as a prevalent approach. Despite the effectiveness of the EMO paradigm, the analysis of resultant solution sets presents considerable challenges. This is primarily attributed to the high-dimensional nature of the data and the constraints imposed by static visualization methods, which frequently culminate in visual clutter and impede interactive exploratory analysis. To address these challenges, this paper introduces ParetoLens, a visual analytics framework specifically tailored to enhance the inspection and exploration of solution sets derived from the multi-objective evolutionary algorithms. Utilizing a modularized, algorithm-agnostic design, ParetoLens enables a detailed inspection of solution distributions in both decision and objective spaces through a suite of interactive visual representations. This approach not only mitigates the issues associated with static visualizations but also supports a more nuanced and flexible analysis process. The usability of the framework is evaluated through case studies and expert interviews, demonstrating its potential to uncover complex patterns and facilitate a deeper understanding of multi-objective optimization solution sets. A demo website of ParetoLens is available at https://dva-lab.org/paretolens/.
△ Less
Submitted 6 January, 2025;
originally announced January 2025.
-
Mining Platoon Patterns from Traffic Videos
Authors:
Yijun Bei,
Teng Ma,
Dongxiang Zhang,
Sai Wu,
Kian-Lee Tan,
Gang Chen
Abstract:
Discovering co-movement patterns from urban-scale video data sources has emerged as an attractive topic. This task aims to identify groups of objects that travel together along a common route, which offers effective support for government agencies in enhancing smart city management. However, the previous work has made a strong assumption on the accuracy of recovered trajectories from videos and th…
▽ More
Discovering co-movement patterns from urban-scale video data sources has emerged as an attractive topic. This task aims to identify groups of objects that travel together along a common route, which offers effective support for government agencies in enhancing smart city management. However, the previous work has made a strong assumption on the accuracy of recovered trajectories from videos and their co-movement pattern definition requires the group of objects to appear across consecutive cameras along the common route. In practice, this often leads to missing patterns if a vehicle is not correctly identified from a certain camera due to object occlusion or vehicle mis-matching. To address this challenge, we propose a relaxed definition of co-movement patterns from video data, which removes the consecutiveness requirement in the common route and accommodates a certain number of missing captured cameras for objects within the group. Moreover, a novel enumeration framework called MaxGrowth is developed to efficiently retrieve the relaxed patterns. Unlike previous filter-and-refine frameworks comprising both candidate enumeration and subsequent candidate verification procedures, MaxGrowth incurs no verification cost for the candidate patterns. It treats the co-movement pattern as an equivalent sequence of clusters, enumerating candidates with increasing sequence length while avoiding the generation of any false positives. Additionally, we also propose two effective pruning rules to efficiently filter the non-maximal patterns. Extensive experiments are conducted to validate the efficiency of MaxGrowth and the quality of its generated co-movement patterns. Our MaxGrowth runs up to two orders of magnitude faster than the baseline algorithm. It also demonstrates high accuracy in real video dataset when the trajectory recovery algorithm is not perfect.
△ Less
Submitted 1 January, 2025; v1 submitted 28 December, 2024;
originally announced December 2024.
-
MERaLiON-SpeechEncoder: Towards a Speech Foundation Model for Singapore and Beyond
Authors:
Muhammad Huzaifah,
Geyu Lin,
Tianchi Liu,
Hardik B. Sailor,
Kye Min Tan,
Tarun K. Vangani,
Qiongqiong Wang,
Jeremy H. M. Wong,
Nancy F. Chen,
Ai Ti Aw
Abstract:
This technical report describes the MERaLiON-SpeechEncoder, a foundation model designed to support a wide range of downstream speech applications. Developed as part of Singapore's National Multimodal Large Language Model Programme, the MERaLiON-SpeechEncoder is tailored to address the speech processing needs in Singapore and the surrounding Southeast Asian region. The model currently supports main…
▽ More
This technical report describes the MERaLiON-SpeechEncoder, a foundation model designed to support a wide range of downstream speech applications. Developed as part of Singapore's National Multimodal Large Language Model Programme, the MERaLiON-SpeechEncoder is tailored to address the speech processing needs in Singapore and the surrounding Southeast Asian region. The model currently supports mainly English, including the variety spoken in Singapore. We are actively expanding our datasets to gradually cover other languages in subsequent releases. The MERaLiON-SpeechEncoder was pre-trained from scratch on 200,000 hours of unlabelled speech data using a self-supervised learning approach based on masked language modelling. We describe our training procedure and hyperparameter tuning experiments in detail below. Our evaluation demonstrates improvements to spontaneous and Singapore speech benchmarks for speech recognition, while remaining competitive to other state-of-the-art speech encoders across ten other speech tasks. We commit to releasing our model, supporting broader research endeavours, both in Singapore and beyond.
△ Less
Submitted 20 December, 2024; v1 submitted 16 December, 2024;
originally announced December 2024.
-
PhishIntel: Toward Practical Deployment of Reference-Based Phishing Detection
Authors:
Yuexin Li,
Hiok Kuek Tan,
Qiaoran Meng,
Mei Lin Lock,
Tri Cao,
Shumin Deng,
Nay Oo,
Hoon Wei Lim,
Bryan Hooi
Abstract:
Phishing is a critical cyber threat, exploiting deceptive tactics to compromise victims and cause significant financial losses. While reference-based phishing detectors (RBPDs) have achieved notable advancements in detection accuracy, their real-world deployment is hindered by challenges such as high latency and inefficiency in URL analysis. To address these limitations, we present PhishIntel, an…
▽ More
Phishing is a critical cyber threat, exploiting deceptive tactics to compromise victims and cause significant financial losses. While reference-based phishing detectors (RBPDs) have achieved notable advancements in detection accuracy, their real-world deployment is hindered by challenges such as high latency and inefficiency in URL analysis. To address these limitations, we present PhishIntel, an end-to-end phishing detection system for real-world deployment. PhishIntel intelligently determines whether a URL can be processed immediately or not, segmenting the detection process into two distinct tasks: a fast task that checks against local blacklists and result cache, and a slow task that conducts online blacklist verification, URL crawling, and webpage analysis using an RBPD. This fast-slow task system architecture ensures low response latency while retaining the robust detection capabilities of RBPDs for zero-day phishing threats. Furthermore, we develop two downstream applications based on PhishIntel: a phishing intelligence platform and a phishing email detection plugin for Microsoft Outlook, demonstrating its practical efficacy and utility.
△ Less
Submitted 14 February, 2025; v1 submitted 12 December, 2024;
originally announced December 2024.
-
Interpretable Hierarchical Attention Network for Medical Condition Identification
Authors:
Dongping Fang,
Lian Duan,
Xiaojing Yuan,
Allyn Klunder,
Kevin Tan,
Suiting Cao,
Yeqing Ji,
Mike Xu
Abstract:
Accurate prediction of medical conditions with straight past clinical evidence is a long-sought topic in the medical management and health insurance field. Although great progress has been made with machine learning algorithms, the medical community is still skeptical about the model accuracy and interpretability. This paper presents an innovative hierarchical attention deep learning model to achi…
▽ More
Accurate prediction of medical conditions with straight past clinical evidence is a long-sought topic in the medical management and health insurance field. Although great progress has been made with machine learning algorithms, the medical community is still skeptical about the model accuracy and interpretability. This paper presents an innovative hierarchical attention deep learning model to achieve better prediction and clear interpretability that can be easily understood by medical professionals.
This paper developed an Interpretable Hierarchical Attention Network (IHAN). IHAN uses a hierarchical attention structure that matches naturally with the medical history data structure and reflects patients encounter (date of service) sequence. The model attention structure consists of 3 levels: (1) attention on the medical code types (diagnosis codes, procedure codes, lab test results, and prescription drugs), (2) attention on the sequential medical encounters within a type, (3) attention on the individual medical codes within an encounter and type.
This model is applied to predict the occurrence of stage 3 chronic kidney disease (CKD), using three years medical history of Medicare Advantage (MA) members from an American nationwide health insurance company. The model takes members medical events, both claims and Electronic Medical Records (EMR) data, as input, makes a prediction of stage 3 CKD and calculates contribution from individual events to the predicted outcome.
△ Less
Submitted 4 December, 2024;
originally announced December 2024.
-
Intent-Aware Dialogue Generation and Multi-Task Contrastive Learning for Multi-Turn Intent Classification
Authors:
Junhua Liu,
Yong Keat Tan,
Bin Fu,
Kwan Hui Lim
Abstract:
Generating large-scale, domain-specific, multilingual multi-turn dialogue datasets remains a significant hurdle for training effective Multi-Turn Intent Classification models in chatbot systems. In this paper, we introduce Chain-of-Intent, a novel mechanism that combines Hidden Markov Models with Large Language Models (LLMs) to generate contextually aware, intent-driven conversations through self-…
▽ More
Generating large-scale, domain-specific, multilingual multi-turn dialogue datasets remains a significant hurdle for training effective Multi-Turn Intent Classification models in chatbot systems. In this paper, we introduce Chain-of-Intent, a novel mechanism that combines Hidden Markov Models with Large Language Models (LLMs) to generate contextually aware, intent-driven conversations through self-play. By extracting domain-specific knowledge from e-commerce chat logs, we estimate conversation turns and intent transitions, which guide the generation of coherent dialogues. Leveraging LLMs to enhance emission probabilities, our approach produces natural and contextually consistent questions and answers. We also propose MINT-CL, a framework for multi-turn intent classification using multi-task contrastive learning, improving classification accuracy without the need for extensive annotated data. Evaluations show that our methods outperform baselines in dialogue quality and intent classification accuracy, especially in multilingual settings, while significantly reducing data generation efforts. Furthermore, we release MINT-E, a multilingual, intent-aware multi-turn e-commerce dialogue corpus to support future research in this area.
△ Less
Submitted 21 November, 2024;
originally announced November 2024.
-
Balancing Accuracy and Efficiency in Multi-Turn Intent Classification for LLM-Powered Dialog Systems in Production
Authors:
Junhua Liu,
Yong Keat Tan,
Bin Fu,
Kwan Hui Lim
Abstract:
Accurate multi-turn intent classification is essential for advancing conversational AI systems. However, challenges such as the scarcity of comprehensive datasets and the complexity of contextual dependencies across dialogue turns hinder progress. This paper presents two novel approaches leveraging Large Language Models (LLMs) to enhance scalability and reduce latency in production dialogue system…
▽ More
Accurate multi-turn intent classification is essential for advancing conversational AI systems. However, challenges such as the scarcity of comprehensive datasets and the complexity of contextual dependencies across dialogue turns hinder progress. This paper presents two novel approaches leveraging Large Language Models (LLMs) to enhance scalability and reduce latency in production dialogue systems. First, we introduce Symbol Tuning, which simplifies intent labels to reduce task complexity and improve performance in multi-turn dialogues. Second, we propose C-LARA (Consistency-aware, Linguistics Adaptive Retrieval Augmentation), a framework that employs LLMs for data augmentation and pseudo-labeling to generate synthetic multi-turn dialogues. These enriched datasets are used to fine-tune a small, efficient model suitable for deployment. Experiments conducted on multilingual dialogue datasets demonstrate significant improvements in classification accuracy and resource efficiency. Our methods enhance multi-turn intent classification accuracy by 5.09%, reduce annotation costs by 40%, and enable scalable deployment in low-resource multilingual industrial systems, highlighting their practicality and impact.
△ Less
Submitted 19 November, 2024;
originally announced November 2024.
-
MBL-CPDP: A Multi-objective Bilevel Method for Cross-Project Defect Prediction via Automated Machine Learning
Authors:
Jiaxin Chen,
Jinliang Ding,
Kay Chen Tan,
Jiancheng Qian,
Ke Li
Abstract:
Cross-project defect prediction (CPDP) leverages machine learning (ML) techniques to proactively identify software defects, especially where project-specific data is scarce. However, developing a robust ML pipeline with optimal hyperparameters that effectively use cross-project information and yield satisfactory performance remains challenging. In this paper, we resolve this bottleneck by formulat…
▽ More
Cross-project defect prediction (CPDP) leverages machine learning (ML) techniques to proactively identify software defects, especially where project-specific data is scarce. However, developing a robust ML pipeline with optimal hyperparameters that effectively use cross-project information and yield satisfactory performance remains challenging. In this paper, we resolve this bottleneck by formulating CPDP as a multi-objective bilevel optimization (MBLO) method, dubbed MBL-CPDP. It comprises two nested problems: the upper-level, a multi-objective combinatorial optimization problem, enhances robustness and efficiency in optimizing ML pipelines, while the lower-level problem is an expensive optimization problem that focuses on tuning their optimal hyperparameters. Due to the high-dimensional search space characterized by feature redundancy and inconsistent data distributions, the upper-level problem combines feature selection, transfer learning, and classification to leverage limited and heterogeneous historical data. Meanwhile, an ensemble learning method is proposed to capture differences in cross-project distribution and generalize across diverse datasets. Finally, a MBLO algorithm is presented to solve this problem while achieving high adaptability effectively. To evaluate the performance of MBL-CPDP, we compare it with five automated ML tools and $50$ CPDP techniques across $20$ projects. Extensive empirical results show that MBL-CPDPoutperforms the comparison methods, demonstrating its superior adaptability and comprehensive performance evaluation capability.
△ Less
Submitted 10 November, 2024;
originally announced November 2024.
-
Rate, Explain and Cite (REC): Enhanced Explanation and Attribution in Automatic Evaluation by Large Language Models
Authors:
Aliyah R. Hsu,
James Zhu,
Zhichao Wang,
Bin Bi,
Shubham Mehrotra,
Shiva K. Pentyala,
Katherine Tan,
Xiang-Bo Mao,
Roshanak Omrani,
Sougata Chaudhuri,
Regunathan Radhakrishnan,
Sitaram Asur,
Claire Na Cheng,
Bin Yu
Abstract:
LLMs have demonstrated impressive proficiency in generating coherent and high-quality text, making them valuable across a range of text-generation tasks. However, rigorous evaluation of this generated content is crucial, as ensuring its quality remains a significant challenge due to persistent issues such as factual inaccuracies and hallucination. This paper introduces three fine-tuned general-pur…
▽ More
LLMs have demonstrated impressive proficiency in generating coherent and high-quality text, making them valuable across a range of text-generation tasks. However, rigorous evaluation of this generated content is crucial, as ensuring its quality remains a significant challenge due to persistent issues such as factual inaccuracies and hallucination. This paper introduces three fine-tuned general-purpose LLM autoevaluators, REC-8B, REC-12B and REC-70B, specifically designed to evaluate generated text across several dimensions: faithfulness, instruction following, coherence, and completeness. These models not only provide ratings for these metrics but also offer detailed explanation and verifiable citation, thereby enhancing trust in the content. Moreover, the models support various citation modes, accommodating different requirements for latency and granularity. Extensive evaluations on diverse benchmarks demonstrate that our general-purpose LLM auto-evaluator, REC-70B, outperforms state-of-the-art LLMs, excelling in content evaluation by delivering better quality explanation and citation with minimal bias. It achieves Rank #1 as of Feb 15th, 2025 as a generative model on the RewardBench leaderboard under the model name TextEval-Llama3.1-70B. Our REC dataset and models are available at https://github.com/adelaidehsu/REC.
△ Less
Submitted 18 February, 2025; v1 submitted 2 November, 2024;
originally announced November 2024.
-
Toward Automated Algorithm Design: A Survey and Practical Guide to Meta-Black-Box-Optimization
Authors:
Zeyuan Ma,
Hongshu Guo,
Yue-Jiao Gong,
Jun Zhang,
Kay Chen Tan
Abstract:
In this survey, we introduce Meta-Black-Box-Optimization~(MetaBBO) as an emerging avenue within the Evolutionary Computation~(EC) community, which incorporates Meta-learning approaches to assist automated algorithm design. Despite the success of MetaBBO, the current literature provides insufficient summaries of its key aspects and lacks practical guidance for implementation. To bridge this gap, we…
▽ More
In this survey, we introduce Meta-Black-Box-Optimization~(MetaBBO) as an emerging avenue within the Evolutionary Computation~(EC) community, which incorporates Meta-learning approaches to assist automated algorithm design. Despite the success of MetaBBO, the current literature provides insufficient summaries of its key aspects and lacks practical guidance for implementation. To bridge this gap, we offer a comprehensive review of recent advances in MetaBBO, providing an in-depth examination of its key developments. We begin with a unified definition of the MetaBBO paradigm, followed by a systematic taxonomy of various algorithm design tasks, including algorithm selection, algorithm configuration, solution manipulation, and algorithm generation. Further, we conceptually summarize different learning methodologies behind current MetaBBO works, including reinforcement learning, supervised learning, neuroevolution, and in-context learning with Large Language Models. A comprehensive evaluation of the latest representative MetaBBO methods is then carried out, alongside an experimental analysis of their optimization performance, computational efficiency, and generalization ability. Based on the evaluation results, we meticulously identify a set of core designs that enhance the generalization and learning effectiveness of MetaBBO. Finally, we outline the vision for the field by providing insight into the latest trends and potential future directions. Relevant literature will be continuously collected and updated at https://github.com/MetaEvo/Awesome-MetaBBO.
△ Less
Submitted 30 April, 2025; v1 submitted 1 November, 2024;
originally announced November 2024.
-
A Mixed-Language Multi-Document News Summarization Dataset and a Graphs-Based Extract-Generate Model
Authors:
Shengxiang Gao,
Fang nan,
Yongbing Zhang,
Yuxin Huang,
Kaiwen Tan,
Zhengtao Yu
Abstract:
Existing research on news summarization primarily focuses on single-language single-document (SLSD), single-language multi-document (SLMD) or cross-language single-document (CLSD). However, in real-world scenarios, news about a international event often involves multiple documents in different languages, i.e., mixed-language multi-document (MLMD). Therefore, summarizing MLMD news is of great signi…
▽ More
Existing research on news summarization primarily focuses on single-language single-document (SLSD), single-language multi-document (SLMD) or cross-language single-document (CLSD). However, in real-world scenarios, news about a international event often involves multiple documents in different languages, i.e., mixed-language multi-document (MLMD). Therefore, summarizing MLMD news is of great significance. However, the lack of datasets for MLMD news summarization has constrained the development of research in this area. To fill this gap, we construct a mixed-language multi-document news summarization dataset (MLMD-news), which contains four different languages and 10,992 source document cluster and target summary pairs. Additionally, we propose a graph-based extract-generate model and benchmark various methods on the MLMD-news dataset and publicly release our dataset and code\footnote[1]{https://github.com/Southnf9/MLMD-news}, aiming to advance research in summarization within MLMD scenarios.
△ Less
Submitted 13 October, 2024;
originally announced October 2024.
-
Towards Ultra-Low-Power Neuromorphic Speech Enhancement with Spiking-FullSubNet
Authors:
Xiang Hao,
Chenxiang Ma,
Qu Yang,
Jibin Wu,
Kay Chen Tan
Abstract:
Speech enhancement is critical for improving speech intelligibility and quality in various audio devices. In recent years, deep learning-based methods have significantly improved speech enhancement performance, but they often come with a high computational cost, which is prohibitive for a large number of edge devices, such as headsets and hearing aids. This work proposes an ultra-low-power speech…
▽ More
Speech enhancement is critical for improving speech intelligibility and quality in various audio devices. In recent years, deep learning-based methods have significantly improved speech enhancement performance, but they often come with a high computational cost, which is prohibitive for a large number of edge devices, such as headsets and hearing aids. This work proposes an ultra-low-power speech enhancement system based on the brain-inspired spiking neural network (SNN) called Spiking-FullSubNet. Spiking-FullSubNet follows a full-band and sub-band fusioned approach to effectively capture both global and local spectral information. To enhance the efficiency of computationally expensive sub-band modeling, we introduce a frequency partitioning method inspired by the sensitivity profile of the human peripheral auditory system. Furthermore, we introduce a novel spiking neuron model that can dynamically control the input information integration and forgetting, enhancing the multi-scale temporal processing capability of SNN, which is critical for speech denoising. Experiments conducted on the recent Intel Neuromorphic Deep Noise Suppression (N-DNS) Challenge dataset show that the Spiking-FullSubNet surpasses state-of-the-art methods by large margins in terms of both speech quality and energy efficiency metrics. Notably, our system won the championship of the Intel N-DNS Challenge (Algorithmic Track), opening up a myriad of opportunities for ultra-low-power speech enhancement at the edge. Our source code and model checkpoints are publicly available at https://github.com/haoxiangsnr/spiking-fullsubnet.
△ Less
Submitted 7 October, 2024;
originally announced October 2024.
-
Estimating Generalization Performance Along the Trajectory of Proximal SGD in Robust Regression
Authors:
Kai Tan,
Pierre C. Bellec
Abstract:
This paper studies the generalization performance of iterates obtained by Gradient Descent (GD), Stochastic Gradient Descent (SGD) and their proximal variants in high-dimensional robust regression problems. The number of features is comparable to the sample size and errors may be heavy-tailed. We introduce estimators that precisely track the generalization error of the iterates along the trajector…
▽ More
This paper studies the generalization performance of iterates obtained by Gradient Descent (GD), Stochastic Gradient Descent (SGD) and their proximal variants in high-dimensional robust regression problems. The number of features is comparable to the sample size and errors may be heavy-tailed. We introduce estimators that precisely track the generalization error of the iterates along the trajectory of the iterative algorithm. These estimators are provably consistent under suitable conditions. The results are illustrated through several examples, including Huber regression, pseudo-Huber regression, and their penalized variants with non-smooth regularizer. We provide explicit generalization error estimates for iterates generated from GD and SGD, or from proximal SGD in the presence of a non-smooth regularizer. The proposed risk estimates serve as effective proxies for the actual generalization error, allowing us to determine the optimal stopping iteration that minimizes the generalization error. Extensive simulations confirm the effectiveness of the proposed generalization error estimates.
△ Less
Submitted 3 November, 2024; v1 submitted 3 October, 2024;
originally announced October 2024.
-
Calibrate to Discriminate: Improve In-Context Learning with Label-Free Comparative Inference
Authors:
Wei Cheng,
Tianlu Wang,
Yanmin Ji,
Fan Yang,
Keren Tan,
Yiyu Zheng
Abstract:
While in-context learning with large language models (LLMs) has shown impressive performance, we have discovered a unique miscalibration behavior where both correct and incorrect predictions are assigned the same level of confidence. We refer to this phenomenon as indiscriminate miscalibration. We found that traditional calibration metrics, such as Expected Calibrated Errors (ECEs), are unable to…
▽ More
While in-context learning with large language models (LLMs) has shown impressive performance, we have discovered a unique miscalibration behavior where both correct and incorrect predictions are assigned the same level of confidence. We refer to this phenomenon as indiscriminate miscalibration. We found that traditional calibration metrics, such as Expected Calibrated Errors (ECEs), are unable to capture this behavior effectively. To address this issue, we propose new metrics to measure the severity of indiscriminate miscalibration. Additionally, we develop a novel in-context comparative inference method to alleviate miscalibrations and improve classification performance. Through extensive experiments on five datasets, we demonstrate that our proposed method can achieve more accurate and calibrated predictions compared to regular zero-shot and few-shot prompting.
△ Less
Submitted 3 October, 2024;
originally announced October 2024.
-
HM3: Hierarchical Multi-Objective Model Merging for Pretrained Models
Authors:
Yu Zhou,
Xingyu Wu,
Jibin Wu,
Liang Feng,
Kay Chen Tan
Abstract:
Model merging is a technique that combines multiple large pretrained models into a single model with enhanced performance and broader task adaptability. It has gained popularity in large pretrained model development due to its ability to bypass the need for original training data and further training processes. However, most existing model merging approaches focus solely on exploring the parameter…
▽ More
Model merging is a technique that combines multiple large pretrained models into a single model with enhanced performance and broader task adaptability. It has gained popularity in large pretrained model development due to its ability to bypass the need for original training data and further training processes. However, most existing model merging approaches focus solely on exploring the parameter space, merging models with identical architectures. Merging within the architecture space, despite its potential, remains in its early stages due to the vast search space and the challenges of layer compatibility. This paper marks a significant advance toward more flexible and comprehensive model merging techniques by modeling the architecture-space merging process as a reinforcement learning task. We train policy and value networks using offline sampling of weight vectors, which are then employed for the online optimization of merging strategies. Moreover, a multi-objective optimization paradigm is introduced to accommodate users' diverse task preferences, learning the Pareto front of optimal models to offer customized merging suggestions. Experimental results across multiple tasks, including text translation, mathematical reasoning, and code generation, validate the effectiveness and superiority of the proposed framework in model merging. The code will be made publicly available after the review process.
△ Less
Submitted 27 September, 2024;
originally announced September 2024.
-
Unsupervised Fingerphoto Presentation Attack Detection With Diffusion Models
Authors:
Hailin Li,
Raghavendra Ramachandra,
Mohamed Ragab,
Soumik Mondal,
Yong Kiam Tan,
Khin Mi Mi Aung
Abstract:
Smartphone-based contactless fingerphoto authentication has become a reliable alternative to traditional contact-based fingerprint biometric systems owing to rapid advances in smartphone camera technology. Despite its convenience, fingerprint authentication through fingerphotos is more vulnerable to presentation attacks, which has motivated recent research efforts towards developing fingerphoto Pr…
▽ More
Smartphone-based contactless fingerphoto authentication has become a reliable alternative to traditional contact-based fingerprint biometric systems owing to rapid advances in smartphone camera technology. Despite its convenience, fingerprint authentication through fingerphotos is more vulnerable to presentation attacks, which has motivated recent research efforts towards developing fingerphoto Presentation Attack Detection (PAD) techniques. However, prior PAD approaches utilized supervised learning methods that require labeled training data for both bona fide and attack samples. This can suffer from two key issues, namely (i) generalization:the detection of novel presentation attack instruments (PAIs) unseen in the training data, and (ii) scalability:the collection of a large dataset of attack samples using different PAIs. To address these challenges, we propose a novel unsupervised approach based on a state-of-the-art deep-learning-based diffusion model, the Denoising Diffusion Probabilistic Model (DDPM), which is trained solely on bona fide samples. The proposed approach detects Presentation Attacks (PA) by calculating the reconstruction similarity between the input and output pairs of the DDPM. We present extensive experiments across three PAI datasets to test the accuracy and generalization capability of our approach. The results show that the proposed DDPM-based PAD method achieves significantly better detection error rates on several PAI classes compared to other baseline unsupervised approaches.
△ Less
Submitted 27 September, 2024;
originally announced September 2024.
-
Advancing Automated Knowledge Transfer in Evolutionary Multitasking via Large Language Models
Authors:
Yuxiao Huang,
Xuebin Lv,
Shenghao Wu,
Jibin Wu,
Liang Feng,
Kay Chen Tan
Abstract:
Evolutionary Multi-task Optimization (EMTO) is a paradigm that leverages knowledge transfer across simultaneously optimized tasks for enhanced search performance. To facilitate EMTO's performance, various knowledge transfer models have been developed for specific optimization tasks. However, designing these models often requires substantial expert knowledge. Recently, large language models (LLMs)…
▽ More
Evolutionary Multi-task Optimization (EMTO) is a paradigm that leverages knowledge transfer across simultaneously optimized tasks for enhanced search performance. To facilitate EMTO's performance, various knowledge transfer models have been developed for specific optimization tasks. However, designing these models often requires substantial expert knowledge. Recently, large language models (LLMs) have achieved remarkable success in autonomous programming, aiming to produce effective solvers for specific problems. In this work, a LLM-based optimization paradigm is introduced to establish an autonomous model factory for generating knowledge transfer models, ensuring effective and efficient knowledge transfer across various optimization tasks. To evaluate the performance of the proposed method, we conducted comprehensive empirical studies comparing the knowledge transfer model generated by the LLM with existing state-of-the-art knowledge transfer methods. The results demonstrate that the generated model is able to achieve superior or competitive performance against hand-crafted knowledge transfer models in terms of both efficiency and effectiveness.
△ Less
Submitted 6 September, 2024;
originally announced September 2024.
-
AgGym: An agricultural biotic stress simulation environment for ultra-precision management planning
Authors:
Mahsa Khosravi,
Matthew Carroll,
Kai Liang Tan,
Liza Van der Laan,
Joscif Raigne,
Daren S. Mueller,
Arti Singh,
Aditya Balu,
Baskar Ganapathysubramanian,
Asheesh Kumar Singh,
Soumik Sarkar
Abstract:
Agricultural production requires careful management of inputs such as fungicides, insecticides, and herbicides to ensure a successful crop that is high-yielding, profitable, and of superior seed quality. Current state-of-the-art field crop management relies on coarse-scale crop management strategies, where entire fields are sprayed with pest and disease-controlling chemicals, leading to increased…
▽ More
Agricultural production requires careful management of inputs such as fungicides, insecticides, and herbicides to ensure a successful crop that is high-yielding, profitable, and of superior seed quality. Current state-of-the-art field crop management relies on coarse-scale crop management strategies, where entire fields are sprayed with pest and disease-controlling chemicals, leading to increased cost and sub-optimal soil and crop management. To overcome these challenges and optimize crop production, we utilize machine learning tools within a virtual field environment to generate localized management plans for farmers to manage biotic threats while maximizing profits. Specifically, we present AgGym, a modular, crop and stress agnostic simulation framework to model the spread of biotic stresses in a field and estimate yield losses with and without chemical treatments. Our validation with real data shows that AgGym can be customized with limited data to simulate yield outcomes under various biotic stress conditions. We further demonstrate that deep reinforcement learning (RL) policies can be trained using AgGym for designing ultra-precise biotic stress mitigation strategies with potential to increase yield recovery with less chemicals and lower cost. Our proposed framework enables personalized decision support that can transform biotic stress management from being schedule based and reactive to opportunistic and prescriptive. We also release the AgGym software implementation as a community resource and invite experts to contribute to this open-sourced publicly available modular environment framework. The source code can be accessed at: https://github.com/SCSLabISU/AgGym.
△ Less
Submitted 1 September, 2024;
originally announced September 2024.
-
PMSN: A Parallel Multi-compartment Spiking Neuron for Multi-scale Temporal Processing
Authors:
Xinyi Chen,
Jibin Wu,
Chenxiang Ma,
Yinsong Yan,
Yujie Wu,
Kay Chen Tan
Abstract:
Spiking Neural Networks (SNNs) hold great potential to realize brain-inspired, energy-efficient computational systems. However, current SNNs still fall short in terms of multi-scale temporal processing compared to their biological counterparts. This limitation has resulted in poor performance in many pattern recognition tasks with information that varies across different timescales. To address thi…
▽ More
Spiking Neural Networks (SNNs) hold great potential to realize brain-inspired, energy-efficient computational systems. However, current SNNs still fall short in terms of multi-scale temporal processing compared to their biological counterparts. This limitation has resulted in poor performance in many pattern recognition tasks with information that varies across different timescales. To address this issue, we put forward a novel spiking neuron model called Parallel Multi-compartment Spiking Neuron (PMSN). The PMSN emulates biological neurons by incorporating multiple interacting substructures and allows for flexible adjustment of the substructure counts to effectively represent temporal information across diverse timescales. Additionally, to address the computational burden associated with the increased complexity of the proposed model, we introduce two parallelization techniques that decouple the temporal dependencies of neuronal updates, enabling parallelized training across different time steps. Our experimental results on a wide range of pattern recognition tasks demonstrate the superiority of PMSN. It outperforms other state-of-the-art spiking neuron models in terms of its temporal processing capacity, training speed, and computation cost. Specifically, compared with the commonly used Leaky Integrate-and-Fire neuron, PMSN offers a simulation acceleration of over 10 $\times$ and a 30 % improvement in accuracy on Sequential CIFAR10 dataset, while maintaining comparable computational cost.
△ Less
Submitted 27 August, 2024;
originally announced August 2024.
-
Design Principle Transfer in Neural Architecture Search via Large Language Models
Authors:
Xun Zhou,
Xingyu Wu,
Liang Feng,
Zhichao Lu,
Kay Chen Tan
Abstract:
Transferable neural architecture search (TNAS) has been introduced to design efficient neural architectures for multiple tasks, to enhance the practical applicability of NAS in real-world scenarios. In TNAS, architectural knowledge accumulated in previous search processes is reused to warm up the architecture search for new tasks. However, existing TNAS methods still search in an extensive search…
▽ More
Transferable neural architecture search (TNAS) has been introduced to design efficient neural architectures for multiple tasks, to enhance the practical applicability of NAS in real-world scenarios. In TNAS, architectural knowledge accumulated in previous search processes is reused to warm up the architecture search for new tasks. However, existing TNAS methods still search in an extensive search space, necessitating the evaluation of numerous architectures. To overcome this challenge, this work proposes a novel transfer paradigm, i.e., design principle transfer. In this work, the linguistic description of various structural components' effects on architectural performance is termed design principles. They are learned from established architectures and then can be reused to reduce the search space by discarding unpromising architectures. Searching in the refined search space can boost both the search performance and efficiency for new NAS tasks. To this end, a large language model (LLM)-assisted design principle transfer (LAPT) framework is devised. In LAPT, LLM is applied to automatically reason the design principles from a set of given architectures, and then a principle adaptation method is applied to refine these principles progressively based on the new search results. Experimental results show that LAPT can beat the state-of-the-art TNAS methods on most tasks and achieve comparable performance on others.
△ Less
Submitted 17 December, 2024; v1 submitted 21 August, 2024;
originally announced August 2024.
-
Crystalline Material Discovery in the Era of Artificial Intelligence
Authors:
Zhenzhong Wang,
Haowei Hua,
Wanyu Lin,
Ming Yang,
Kay Chen Tan
Abstract:
Crystalline materials, with symmetrical and periodic structures, exhibit a wide spectrum of properties and have been widely used in numerous applications across electronics, energy, and beyond. For crystalline materials discovery, traditional experimental and computational approaches are time-consuming and expensive. In these years, thanks to the explosive amount of crystalline materials data, gre…
▽ More
Crystalline materials, with symmetrical and periodic structures, exhibit a wide spectrum of properties and have been widely used in numerous applications across electronics, energy, and beyond. For crystalline materials discovery, traditional experimental and computational approaches are time-consuming and expensive. In these years, thanks to the explosive amount of crystalline materials data, great interest has been given to data-driven materials discovery. Particularly, recent advancements have exploited the expressive representation ability of deep learning to model the highly complex atomic systems within crystalline materials, opening up new avenues for fast and accurate materials discovery. These works typically focus on four types of tasks, including physicochemical property prediction, crystalline material synthesis, aiding characterization, and accelerating theoretical computations. Despite the remarkable progress, there is still a lack of systematic investigation to summarize their distinctions and limitations. To fill this gap, we systematically investigated the progress made in recent years. We first introduce several data representations of the crystalline materials. Based on the representations, we summarize various fundamental deep learning models and their tailored usages in various material discovery tasks. Finally, we highlight the remaining challenges and propose future directions.
△ Less
Submitted 1 February, 2025; v1 submitted 15 August, 2024;
originally announced August 2024.
-
Surrogate-Assisted Search with Competitive Knowledge Transfer for Expensive Optimization
Authors:
Xiaoming Xue,
Yao Hu,
Liang Feng,
Kai Zhang,
Linqi Song,
Kay Chen Tan
Abstract:
Expensive optimization problems (EOPs) have attracted increasing research attention over the decades due to their ubiquity in a variety of practical applications. Despite many sophisticated surrogate-assisted evolutionary algorithms (SAEAs) that have been developed for solving such problems, most of them lack the ability to transfer knowledge from previously-solved tasks and always start their sea…
▽ More
Expensive optimization problems (EOPs) have attracted increasing research attention over the decades due to their ubiquity in a variety of practical applications. Despite many sophisticated surrogate-assisted evolutionary algorithms (SAEAs) that have been developed for solving such problems, most of them lack the ability to transfer knowledge from previously-solved tasks and always start their search from scratch, making them troubled by the notorious cold-start issue. A few preliminary studies that integrate transfer learning into SAEAs still face some issues, such as defective similarity quantification that is prone to underestimate promising knowledge, surrogate-dependency that makes the transfer methods not coherent with the state-of-the-art in SAEAs, etc. In light of the above, a plug and play competitive knowledge transfer method is proposed to boost various SAEAs in this paper. Specifically, both the optimized solutions from the source tasks and the promising solutions acquired by the target surrogate are treated as task-solving knowledge, enabling them to compete with each other to elect the winner for expensive evaluation, thus boosting the search speed on the target task. Moreover, the lower bound of the convergence gain brought by the knowledge competition is mathematically analyzed, which is expected to strengthen the theoretical foundation of sequential transfer optimization. Experimental studies conducted on a series of benchmark problems and a practical application from the petroleum industry verify the efficacy of the proposed method. The source code of the competitive knowledge transfer is available at https://github.com/XmingHsueh/SAS-CKT.
△ Less
Submitted 20 August, 2024; v1 submitted 13 August, 2024;
originally announced August 2024.
-
FoVNet: Configurable Field-of-View Speech Enhancement with Low Computation and Distortion for Smart Glasses
Authors:
Zhongweiyang Xu,
Ali Aroudi,
Ke Tan,
Ashutosh Pandey,
Jung-Suk Lee,
Buye Xu,
Francesco Nesta
Abstract:
This paper presents a novel multi-channel speech enhancement approach, FoVNet, that enables highly efficient speech enhancement within a configurable field of view (FoV) of a smart-glasses user without needing specific target-talker(s) directions. It advances over prior works by enhancing all speakers within any given FoV, with a hybrid signal processing and deep learning approach designed with hi…
▽ More
This paper presents a novel multi-channel speech enhancement approach, FoVNet, that enables highly efficient speech enhancement within a configurable field of view (FoV) of a smart-glasses user without needing specific target-talker(s) directions. It advances over prior works by enhancing all speakers within any given FoV, with a hybrid signal processing and deep learning approach designed with high computational efficiency. The neural network component is designed with ultra-low computation (about 50 MMACS). A multi-channel Wiener filter and a post-processing module are further used to improve perceptual quality. We evaluate our algorithm with a microphone array on smart glasses, providing a configurable, efficient solution for augmented hearing on energy-constrained devices. FoVNet excels in both computational efficiency and speech quality across multiple scenarios, making it a promising solution for smart glasses applications.
△ Less
Submitted 12 August, 2024;
originally announced August 2024.
-
Hybrid Reinforcement Learning Breaks Sample Size Barriers in Linear MDPs
Authors:
Kevin Tan,
Wei Fan,
Yuting Wei
Abstract:
Hybrid Reinforcement Learning (RL), where an agent learns from both an offline dataset and online explorations in an unknown environment, has garnered significant recent interest. A crucial question posed by Xie et al. (2022) is whether hybrid RL can improve upon the existing lower bounds established in purely offline and purely online RL without relying on the single-policy concentrability assump…
▽ More
Hybrid Reinforcement Learning (RL), where an agent learns from both an offline dataset and online explorations in an unknown environment, has garnered significant recent interest. A crucial question posed by Xie et al. (2022) is whether hybrid RL can improve upon the existing lower bounds established in purely offline and purely online RL without relying on the single-policy concentrability assumption. While Li et al. (2023) provided an affirmative answer to this question in the tabular PAC RL case, the question remains unsettled for both the regret-minimizing RL case and the non-tabular case.
In this work, building upon recent advancements in offline RL and reward-agnostic exploration, we develop computationally efficient algorithms for both PAC and regret-minimizing RL with linear function approximation, without single-policy concentrability. We demonstrate that these algorithms achieve sharper error or regret bounds that are no worse than, and can improve on, the optimal sample complexity in offline RL (the first algorithm, for PAC RL) and online RL (the second algorithm, for regret-minimizing RL) in linear Markov decision processes (MDPs), regardless of the quality of the behavior policy. To our knowledge, this work establishes the tightest theoretical guarantees currently available for hybrid RL in linear MDPs.
△ Less
Submitted 8 August, 2024;
originally announced August 2024.
-
CommonUppRoad: A Framework of Formal Modelling, Verifying, Learning, and Visualisation of Autonomous Vehicles
Authors:
Rong Gu,
Kaige Tan,
Andreas Holck Høeg-Petersen,
Lei Feng,
Kim Guldstrand Larsen
Abstract:
Combining machine learning and formal methods (FMs) provides a possible solution to overcome the safety issue of autonomous driving (AD) vehicles. However, there are gaps to be bridged before this combination becomes practically applicable and useful. In an attempt to facilitate researchers in both FMs and AD areas, this paper proposes a framework that combines two well-known tools, namely CommonR…
▽ More
Combining machine learning and formal methods (FMs) provides a possible solution to overcome the safety issue of autonomous driving (AD) vehicles. However, there are gaps to be bridged before this combination becomes practically applicable and useful. In an attempt to facilitate researchers in both FMs and AD areas, this paper proposes a framework that combines two well-known tools, namely CommonRoad and UPPAAL. On the one hand, CommonRoad can be enhanced by the rigorous semantics of models in UPPAAL, which enables a systematic and comprehensive understanding of the AD system's behaviour and thus strengthens the safety of the system. On the other hand, controllers synthesised by UPPAAL can be visualised by CommonRoad in real-world road networks, which facilitates AD vehicle designers greatly adopting formal models in system design. In this framework, we provide automatic model conversions between CommonRoad and UPPAAL. Therefore, users only need to program in Python and the framework takes care of the formal models, learning, and verification in the backend. We perform experiments to demonstrate the applicability of our framework in various AD scenarios, discuss the advantages of solving motion planning in our framework, and show the scalability limit and possible solutions.
△ Less
Submitted 2 August, 2024;
originally announced August 2024.
-
Progressive Domain Adaptation for Thermal Infrared Object Tracking
Authors:
Qiao Li,
Kanlun Tan,
Qiao Liu,
Di Yuan,
Xin Li,
Yunpeng Liu
Abstract:
Due to the lack of large-scale labeled Thermal InfraRed (TIR) training datasets, most existing TIR trackers are trained directly on RGB datasets. However, tracking methods trained on RGB datasets suffer a significant drop-off in TIR data due to the domain shift issue. To this end, in this work, we propose a Progressive Domain Adaptation framework for TIR Tracking (PDAT), which transfers useful kno…
▽ More
Due to the lack of large-scale labeled Thermal InfraRed (TIR) training datasets, most existing TIR trackers are trained directly on RGB datasets. However, tracking methods trained on RGB datasets suffer a significant drop-off in TIR data due to the domain shift issue. To this end, in this work, we propose a Progressive Domain Adaptation framework for TIR Tracking (PDAT), which transfers useful knowledge learned from RGB tracking to TIR tracking. The framework makes full use of large-scale labeled RGB datasets without requiring time-consuming and labor-intensive labeling of large-scale TIR data. Specifically, we first propose an adversarial-based global domain adaptation module to reduce domain gap on the feature level coarsely. Second, we design a clustering-based subdomain adaptation method to further align the feature distributions of the RGB and TIR datasets finely. These two domain adaptation modules gradually eliminate the discrepancy between the two domains, and thus learn domain-invariant fine-grained features through progressive training. Additionally, we collect a largescale TIR dataset with over 1.48 million unlabeled TIR images for training the proposed domain adaptation framework. Experimental results on five TIR tracking benchmarks show that the proposed method gains a nearly 6% success rate, demonstrating its effectiveness.
△ Less
Submitted 3 September, 2024; v1 submitted 28 July, 2024;
originally announced July 2024.
-
Holistically-Nested Structure-Aware Graph Neural Network for Road Extraction
Authors:
Tinghuai Wang,
Guangming Wang,
Kuan Eeik Tan
Abstract:
Convolutional neural networks (CNN) have made significant advances in detecting roads from satellite images. However, existing CNN approaches are generally repurposed semantic segmentation architectures and suffer from the poor delineation of long and curved regions. Lack of overall road topology and structure information further deteriorates their performance on challenging remote sensing images.…
▽ More
Convolutional neural networks (CNN) have made significant advances in detecting roads from satellite images. However, existing CNN approaches are generally repurposed semantic segmentation architectures and suffer from the poor delineation of long and curved regions. Lack of overall road topology and structure information further deteriorates their performance on challenging remote sensing images. This paper presents a novel multi-task graph neural network (GNN) which simultaneously detects both road regions and road borders; the inter-play between these two tasks unlocks superior performance from two perspectives: (1) the hierarchically detected road borders enable the network to capture and encode holistic road structure to enhance road connectivity (2) identifying the intrinsic correlation of semantic landcover regions mitigates the difficulty in recognizing roads cluttered by regions with similar appearance. Experiments on challenging dataset demonstrate that the proposed architecture can improve the road border delineation and road extraction accuracy compared with the existing methods.
△ Less
Submitted 8 July, 2024; v1 submitted 2 July, 2024;
originally announced July 2024.
-
Learning to Transfer for Evolutionary Multitasking
Authors:
Sheng-Hao Wu,
Yuxiao Huang,
Xingyu Wu,
Liang Feng,
Zhi-Hui Zhan,
Kay Chen Tan
Abstract:
Evolutionary multitasking (EMT) is an emerging approach for solving multitask optimization problems (MTOPs) and has garnered considerable research interest. The implicit EMT is a significant research branch that utilizes evolution operators to enable knowledge transfer (KT) between tasks. However, current approaches in implicit EMT face challenges in adaptability, due to the use of a limited numbe…
▽ More
Evolutionary multitasking (EMT) is an emerging approach for solving multitask optimization problems (MTOPs) and has garnered considerable research interest. The implicit EMT is a significant research branch that utilizes evolution operators to enable knowledge transfer (KT) between tasks. However, current approaches in implicit EMT face challenges in adaptability, due to the use of a limited number of evolution operators and insufficient utilization of evolutionary states for performing KT. This results in suboptimal exploitation of implicit KT's potential to tackle a variety of MTOPs. To overcome these limitations, we propose a novel Learning to Transfer (L2T) framework to automatically discover efficient KT policies for the MTOPs at hand. Our framework conceptualizes the KT process as a learning agent's sequence of strategic decisions within the EMT process. We propose an action formulation for deciding when and how to transfer, a state representation with informative features of evolution states, a reward formulation concerning convergence and transfer efficiency gain, and the environment for the agent to interact with MTOPs. We employ an actor-critic network structure for the agent and learn it via proximal policy optimization. This learned agent can be integrated with various evolutionary algorithms, enhancing their ability to address a range of new MTOPs. Comprehensive empirical studies on both synthetic and real-world MTOPs, encompassing diverse inter-task relationships, function classes, and task distributions are conducted to validate the proposed L2T framework. The results show a marked improvement in the adaptability and performance of implicit EMT when solving a wide spectrum of unseen MTOPs.
△ Less
Submitted 22 June, 2024; v1 submitted 20 June, 2024;
originally announced June 2024.
-
A framework for developing a knowledge management platform
Authors:
Marie Lisandra Zepeda Mendoza,
Sonali Agarwal,
James A. Blackshaw,
Vanesa Bol,
Audrey Fazzi,
Filippo Fiorini,
Amy Louise Foreman,
Nancy George,
Brett R. Johnson,
Brian Martin,
Dave McComb,
Euphemia Mutasa-Gottgens,
Helen Parkinson,
Martin Romacker,
Rolf Russell,
Valérien Ségard,
Shawn Zheng Kai Tan,
Wei Kheng Teh,
F. P. Winstanley,
Benedict Wong,
Adrian M. Smith
Abstract:
Knowledge management (KM) involves collecting, organizing, storing, and disseminating information to improve decision-making, innovation, and performance. Implementing KM at scale has become essential for organizations to effectively leverage vast accessible data. This paper is a compilation of concepts that emerged from KM workshops hosted by EMBL-EBI, attended by SMEs and industry. We provide gu…
▽ More
Knowledge management (KM) involves collecting, organizing, storing, and disseminating information to improve decision-making, innovation, and performance. Implementing KM at scale has become essential for organizations to effectively leverage vast accessible data. This paper is a compilation of concepts that emerged from KM workshops hosted by EMBL-EBI, attended by SMEs and industry. We provide guidance on envisioning, executing, evaluating, and evolving knowledge management platforms. We emphasize essential considerations such as setting knowledge domain boundaries and measuring success, as well as the importance of making knowledge accessible for downstream applications and non-computational users and highlights necessary personal and organizational skills for success. We stress the importance of collaboration and the need for convergence on shared principles and commitment to provide or seek resources to advance KM. The community is invited to join the journey of KM and contribute to the advancement of the field by applying and improving on the guidelines described.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
Physics-Constrained Learning for PDE Systems with Uncertainty Quantified Port-Hamiltonian Models
Authors:
Kaiyuan Tan,
Peilun Li,
Thomas Beckers
Abstract:
Modeling the dynamics of flexible objects has become an emerging topic in the community as these objects become more present in many applications, e.g., soft robotics. Due to the properties of flexible materials, the movements of soft objects are often highly nonlinear and, thus, complex to predict. Data-driven approaches seem promising for modeling those complex dynamics but often neglect basic p…
▽ More
Modeling the dynamics of flexible objects has become an emerging topic in the community as these objects become more present in many applications, e.g., soft robotics. Due to the properties of flexible materials, the movements of soft objects are often highly nonlinear and, thus, complex to predict. Data-driven approaches seem promising for modeling those complex dynamics but often neglect basic physical principles, which consequently makes them untrustworthy and limits generalization. To address this problem, we propose a physics-constrained learning method that combines powerful learning tools and reliable physical models. Our method leverages the data collected from observations by sending them into a Gaussian process that is physically constrained by a distributed Port-Hamiltonian model. Based on the Bayesian nature of the Gaussian process, we not only learn the dynamics of the system, but also enable uncertainty quantification. Furthermore, the proposed approach preserves the compositional nature of Port-Hamiltonian systems.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
AV-CrossNet: an Audiovisual Complex Spectral Mapping Network for Speech Separation By Leveraging Narrow- and Cross-Band Modeling
Authors:
Vahid Ahmadi Kalkhorani,
Cheng Yu,
Anurag Kumar,
Ke Tan,
Buye Xu,
DeLiang Wang
Abstract:
Adding visual cues to audio-based speech separation can improve separation performance. This paper introduces AV-CrossNet, an audiovisual (AV) system for speech enhancement, target speaker extraction, and multi-talker speaker separation. AV-CrossNet is extended from the CrossNet architecture, which is a recently proposed network that performs complex spectral mapping for speech separation by lever…
▽ More
Adding visual cues to audio-based speech separation can improve separation performance. This paper introduces AV-CrossNet, an audiovisual (AV) system for speech enhancement, target speaker extraction, and multi-talker speaker separation. AV-CrossNet is extended from the CrossNet architecture, which is a recently proposed network that performs complex spectral mapping for speech separation by leveraging global attention and positional encoding. To effectively utilize visual cues, the proposed system incorporates pre-extracted visual embeddings and employs a visual encoder comprising temporal convolutional layers. Audio and visual features are fused in an early fusion layer before feeding to AV-CrossNet blocks. We evaluate AV-CrossNet on multiple datasets, including LRS, VoxCeleb, and COG-MHEAR challenge. Evaluation results demonstrate that AV-CrossNet advances the state-of-the-art performance in all audiovisual tasks, even on untrained and mismatched datasets.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.