Search | arXiv e-print repository

Beyond Linearity: Squeeze-and-Recalibrate Blocks for Few-Shot Whole Slide Image Classification

Authors: Conghao Xiong, Zhengrui Guo, Zhe Xu, Yifei Zhang, Raymond Kai-Yu Tong, Si Yong Yeo, Hao Chen, Joseph J. Y. Sung, Irwin King

Abstract: Deep learning has advanced computational pathology but expert annotations remain scarce. Few-shot learning mitigates annotation burdens yet suffers from overfitting and discriminative feature mischaracterization. In addition, the current few-shot multiple instance learning (MIL) approaches leverage pretrained vision-language models to alleviate these issues, but at the cost of complex preprocessin… ▽ More Deep learning has advanced computational pathology but expert annotations remain scarce. Few-shot learning mitigates annotation burdens yet suffers from overfitting and discriminative feature mischaracterization. In addition, the current few-shot multiple instance learning (MIL) approaches leverage pretrained vision-language models to alleviate these issues, but at the cost of complex preprocessing and high computational cost. We propose a Squeeze-and-Recalibrate (SR) block, a drop-in replacement for linear layers in MIL models to address these challenges. The SR block comprises two core components: a pair of low-rank trainable matrices (squeeze pathway, SP) that reduces parameter count and imposes a bottleneck to prevent spurious feature learning, and a frozen random recalibration matrix that preserves geometric structure, diversifies feature directions, and redefines the optimization objective for the SP. We provide theoretical guarantees that the SR block can approximate any linear mapping to arbitrary precision, thereby ensuring that the performance of a standard MIL model serves as a lower bound for its SR-enhanced counterpart. Extensive experiments demonstrate that our SR-MIL models consistently outperform prior methods while requiring significantly fewer parameters and no architectural changes. △ Less

Submitted 21 May, 2025; originally announced May 2025.

arXiv:2503.17069 [pdf, other]

PVChat: Personalized Video Chat with One-Shot Learning

Authors: Yufei Shi, Weilong Yan, Gang Xu, Yumeng Li, Yuchen Li, Zhenxi Li, Fei Richard Yu, Ming Li, Si Yong Yeo

Abstract: Video large language models (ViLLMs) excel in general video understanding, e.g., recognizing activities like talking and eating, but struggle with identity-aware comprehension, such as "Wilson is receiving chemotherapy" or "Tom is discussing with Sarah", limiting their applicability in smart healthcare and smart home environments. To address this limitation, we propose a one-shot learning framewor… ▽ More Video large language models (ViLLMs) excel in general video understanding, e.g., recognizing activities like talking and eating, but struggle with identity-aware comprehension, such as "Wilson is receiving chemotherapy" or "Tom is discussing with Sarah", limiting their applicability in smart healthcare and smart home environments. To address this limitation, we propose a one-shot learning framework PVChat, the first personalized ViLLM that enables subject-aware question answering (QA) from a single video for each subject. Our approach optimizes a Mixture-of-Heads (MoH) enhanced ViLLM on a synthetically augmented video-QA dataset, leveraging a progressive image-to-video learning strategy. Specifically, we introduce an automated augmentation pipeline that synthesizes identity-preserving positive samples and retrieves hard negatives from existing video corpora, generating a diverse training dataset with four QA types: existence, appearance, action, and location inquiries. To enhance subject-specific learning, we propose a ReLU Routing MoH attention mechanism, alongside two novel objectives: (1) Smooth Proximity Regularization for progressive learning through exponential distance scaling and (2) Head Activation Enhancement for balanced attention routing. Finally, we adopt a two-stage training strategy, transitioning from image pre-training to video fine-tuning, enabling a gradual learning process from static attributes to dynamic representations. We evaluate PVChat on diverse datasets covering medical scenarios, TV series, anime, and real-world footage, demonstrating its superiority in personalized feature understanding after learning from a single video, compared to state-of-the-art ViLLMs. △ Less

Submitted 21 March, 2025; originally announced March 2025.

arXiv:2503.06565 [pdf, other]

Future-Aware Interaction Network For Motion Forecasting

Authors: Shijie Li, Xun Xu, Si Yong Yeo, Xulei Yang

Abstract: Motion forecasting is a crucial component of autonomous driving systems, enabling the generation of accurate and smooth future trajectories to ensure safe navigation to the destination. In previous methods, potential future trajectories are often absent in the scene encoding stage, which may lead to suboptimal outcomes. Additionally, prior approaches typically employ transformer architectures for… ▽ More Motion forecasting is a crucial component of autonomous driving systems, enabling the generation of accurate and smooth future trajectories to ensure safe navigation to the destination. In previous methods, potential future trajectories are often absent in the scene encoding stage, which may lead to suboptimal outcomes. Additionally, prior approaches typically employ transformer architectures for spatiotemporal modeling of trajectories and map information, which suffer from the quadratic scaling complexity of the transformer architecture. In this work, we propose an interaction-based method, named Future-Aware Interaction Network, that introduces potential future trajectories into scene encoding for a comprehensive traffic representation. Furthermore, a State Space Model (SSM), specifically Mamba, is introduced for both spatial and temporal modeling. To adapt Mamba for spatial interaction modeling, we propose an adaptive reordering strategy that transforms unordered data into a structured sequence. Additionally, Mamba is employed to refine generated future trajectories temporally, ensuring more consistent predictions. These enhancements not only improve model efficiency but also enhance the accuracy and diversity of predictions. We conduct comprehensive experiments on the widely used Argoverse 1 and Argoverse 2 datasets, demonstrating that the proposed method achieves superior performance compared to previous approaches in a more efficient way. The code will be released according to the acceptance. △ Less

Submitted 9 March, 2025; originally announced March 2025.

arXiv:2503.01019 [pdf, other]

MedUnifier: Unifying Vision-and-Language Pre-training on Medical Data with Vision Generation Task using Discrete Visual Representations

Authors: Ziyang Zhang, Yang Yu, Yucheng Chen, Xulei Yang, Si Yong Yeo

Abstract: Despite significant progress in Vision-Language Pre-training (VLP), current approaches predominantly emphasize feature extraction and cross-modal comprehension, with limited attention to generating or transforming visual content. This gap hinders the model's ability to synthesize coherent and novel visual representations from textual prompts, thereby reducing the effectiveness of multi-modal learn… ▽ More Despite significant progress in Vision-Language Pre-training (VLP), current approaches predominantly emphasize feature extraction and cross-modal comprehension, with limited attention to generating or transforming visual content. This gap hinders the model's ability to synthesize coherent and novel visual representations from textual prompts, thereby reducing the effectiveness of multi-modal learning. In this work, we propose MedUnifier, a unified VLP framework tailored for medical data. MedUnifier seamlessly integrates text-grounded image generation capabilities with multi-modal learning strategies, including image-text contrastive alignment, image-text matching and image-grounded text generation. Unlike traditional methods that reply on continuous visual representations, our approach employs visual vector quantization, which not only facilitates a more cohesive learning strategy for cross-modal understanding but also enhances multi-modal generation quality by effectively leveraging discrete representations. Our framework's effectiveness is evidenced by the experiments on established benchmarks, including uni-modal tasks (supervised fine-tuning), cross-modal tasks (image-text retrieval and zero-shot image classification), and multi-modal tasks (medical report generation, image synthesis), where it achieves state-of-the-art performance across various tasks. MedUnifier also offers a highly adaptable tool for a wide range of language and vision tasks in healthcare, marking advancement toward the development of a generalizable AI model for medical applications. △ Less

Submitted 20 April, 2025; v1 submitted 2 March, 2025; originally announced March 2025.

Comments: To be pubilshed in CVPR 2025

arXiv:2501.00775 [pdf, other]

MindCoder: Automated and Controllable Reasoning Chain in Qualitative Analysis

Authors: Jie Gao, Zhiyao Shu, Shun Yi Yeo

Abstract: Extracting insights from qualitative analysis involves a series of reasoning steps, such as open coding, grouping, and identifying themes. We introduce the MindCoder reasoning chain, built on Chain-of-Thought (CoT) prompting, to support the insight extraction process step by step-including topic clustering, code labeling, conceptualization, and reporting. We designed the MindCoder web application… ▽ More Extracting insights from qualitative analysis involves a series of reasoning steps, such as open coding, grouping, and identifying themes. We introduce the MindCoder reasoning chain, built on Chain-of-Thought (CoT) prompting, to support the insight extraction process step by step-including topic clustering, code labeling, conceptualization, and reporting. We designed the MindCoder web application to help users 1) automatically run this reasoning chain (i.e., obtain analysis report results in approximately 3-5 minutes) and 2) interactively control the reasoning process on demand. Our technical evaluations assess its reliability across various data types and demonstrate that simulated human iteration can potentially enhance coding quality. A user study further confirmed positive feedback regarding MindCoder's automation and its on-demand reasoning functionality. △ Less

Submitted 16 April, 2025; v1 submitted 1 January, 2025; originally announced January 2025.

Comments: 17 pages for main content, 3 pages for references, 10 pages for appendix

arXiv:2408.10222 [pdf, other]

Near-Orthogonal Overlay Communications in LoS Channel Enabled by Novel OAM Beams without Central Energy Voids: An Experimental Study

Authors: Yufei Zhao, Xiaoyan Ma, Yong Liang Guan, Yile Liu, Afkar Mohamed Ismail, Xiaobei Liu, Siew Yam Yeo, Chau Yuen

Abstract: This paper introduces a groundbreaking Line-of-Sight (LoS) Multiple-Input Multiple-Output (MIMO) communication architecture leveraging non-traditional Orbital Angular Momentum (OAM) beams. Challenging the conventional paradigm of hollow-emitting OAM beams, this study presents an innovative OAM transmitter design that produces directional OAM beams without central energy voids, aligning their radia… ▽ More This paper introduces a groundbreaking Line-of-Sight (LoS) Multiple-Input Multiple-Output (MIMO) communication architecture leveraging non-traditional Orbital Angular Momentum (OAM) beams. Challenging the conventional paradigm of hollow-emitting OAM beams, this study presents an innovative OAM transmitter design that produces directional OAM beams without central energy voids, aligning their radiation patterns with those of conventional planar wave horn antennas. Within the main lobe of these antennas, the phase variation characteristics inherent to OAM beams are ingeniously maintained, linking different OAM modes to the linear wavefront variation gradients, thereby reducing channel correlation in LoS scenarios and significantly augmenting the channel capacity of LoS-MIMO frameworks. Empirical validations conducted through a meticulously designed LoS-MIMO experimental platform reveal significant improvements in channel correlation coefficients, communication stability, and Bit Error Rate (BER) compared to systems utilizing traditional planar wave antennas. The experiment results underscore the potential of the novel OAM-based system to improve current LoS-MIMO communication protocols, and offer both academic and engineering guidance for the construction of practical communication infrastructures. Beyond its immediate contributions, this paper underscores a pivotal shift in the field of communications, pointing out that traditional communication algorithms have primarily focused on baseband signal processing while often overlooking the electromagnetic characteristics of the physical world. △ Less

Submitted 27 July, 2024; originally announced August 2024.

arXiv:2407.14230 [pdf, other]

ETSCL: An Evidence Theory-Based Supervised Contrastive Learning Framework for Multi-modal Glaucoma Grading

Authors: Zhiyuan Yang, Bo Zhang, Yufei Shi, Ningze Zhong, Johnathan Loh, Huihui Fang, Yanwu Xu, Si Yong Yeo

Abstract: Glaucoma is one of the leading causes of vision impairment. Digital imaging techniques, such as color fundus photography (CFP) and optical coherence tomography (OCT), provide quantitative and noninvasive methods for glaucoma diagnosis. Recently, in the field of computer-aided glaucoma diagnosis, multi-modality methods that integrate the CFP and OCT modalities have achieved greater diagnostic accur… ▽ More Glaucoma is one of the leading causes of vision impairment. Digital imaging techniques, such as color fundus photography (CFP) and optical coherence tomography (OCT), provide quantitative and noninvasive methods for glaucoma diagnosis. Recently, in the field of computer-aided glaucoma diagnosis, multi-modality methods that integrate the CFP and OCT modalities have achieved greater diagnostic accuracy compared to single-modality methods. However, it remains challenging to extract reliable features due to the high similarity of medical images and the unbalanced multi-modal data distribution. Moreover, existing methods overlook the uncertainty estimation of different modalities, leading to unreliable predictions. To address these challenges, we propose a novel framework, namely ETSCL, which consists of a contrastive feature extraction stage and a decision-level fusion stage. Specifically, the supervised contrastive loss is employed to enhance the discriminative power in the feature extraction process, resulting in more effective features. In addition, we utilize the Frangi vesselness algorithm as a preprocessing step to incorporate vessel information to assist in the prediction. In the decision-level fusion stage, an evidence theory-based multi-modality classifier is employed to combine multi-source information with uncertainty estimation. Extensive experiments demonstrate that our method achieves state-of-the-art performance. The code is available at \url{https://github.com/master-Shix/ETSCL}. △ Less

Submitted 19 July, 2024; originally announced July 2024.

Comments: Accepted by Ophthalmic Medical Image Analysis Workshop at MICCAI'24

arXiv:2407.05307 [pdf]

Edge-guided and Cross-scale Feature Fusion Network for Efficient Multi-contrast MRI Super-Resolution

Authors: Zhiyuan Yang, Bo Zhang, Zhiqiang Zeng, Si Yong Yeo

Abstract: In recent years, MRI super-resolution techniques have achieved great success, especially multi-contrast methods that extract texture information from reference images to guide the super-resolution reconstruction. However, current methods primarily focus on texture similarities at the same scale, neglecting cross-scale similarities that provide comprehensive information. Moreover, the misalignment… ▽ More In recent years, MRI super-resolution techniques have achieved great success, especially multi-contrast methods that extract texture information from reference images to guide the super-resolution reconstruction. However, current methods primarily focus on texture similarities at the same scale, neglecting cross-scale similarities that provide comprehensive information. Moreover, the misalignment between features of different scales impedes effective aggregation of information flow. To address the limitations, we propose a novel edge-guided and cross-scale feature fusion network, namely ECFNet. Specifically, we develop a pipeline consisting of the deformable convolution and the cross-attention transformer to align features of different scales. The cross-scale fusion strategy fully integrates the texture information from different scales, significantly enhancing the super-resolution. In addition, a novel structure information collaboration module is developed to guide the super-resolution reconstruction with implicit structure priors. The structure information enables the network to focus on high-frequency components of the image, resulting in sharper details. Extensive experiments on the IXI and BraTS2020 datasets demonstrate that our method achieves state-of-the-art performance compared to other multi-contrast MRI super-resolution methods, and our method is robust in terms of different super-resolution scales. We would like to release our code and pre-trained model after the paper is accepted. △ Less

Submitted 24 August, 2024; v1 submitted 7 July, 2024; originally announced July 2024.

Comments: submitted to ICPR2024

arXiv:2401.10820 [pdf, other]

doi 10.1145/3613904.3642530

Help Me Reflect: Leveraging Self-Reflection Interface Nudges to Enhance Deliberativeness on Online Deliberation Platforms

Authors: Shun Yi Yeo, Gionnieve Lim, Jie Gao, Weiyu Zhang, Simon Tangi Perrault

Abstract: The deliberative potential of online platforms has been widely examined. However, little is known about how various interface-based reflection nudges impact the quality of deliberation. This paper presents two user studies with 12 and 120 participants, respectively, to investigate the impacts of different reflective nudges on the quality of deliberation. In the first study, we examined five distin… ▽ More The deliberative potential of online platforms has been widely examined. However, little is known about how various interface-based reflection nudges impact the quality of deliberation. This paper presents two user studies with 12 and 120 participants, respectively, to investigate the impacts of different reflective nudges on the quality of deliberation. In the first study, we examined five distinct reflective nudges: persona, temporal prompts, analogies and metaphors, cultural prompts and storytelling. Persona, temporal prompts, and storytelling emerged as the preferred nudges for implementation on online deliberation platforms. In the second study, we assess the impacts of these preferred reflectors more thoroughly. Results revealed a significant positive impact of these reflectors on deliberative quality. Specifically, persona promotes a deliberative environment for balanced and opinionated viewpoints while temporal prompts promote more individualised viewpoints. Our findings suggest that the choice of reflectors can significantly influence the dynamics and shape the nature of online discussions. △ Less

Submitted 19 January, 2024; originally announced January 2024.

arXiv:2309.17143 [pdf, other]

Revisiting Cephalometric Landmark Detection from the view of Human Pose Estimation with Lightweight Super-Resolution Head

Authors: Qian Wu, Si Yong Yeo, Yufei Chen, Jun Liu

Abstract: Accurate localization of cephalometric landmarks holds great importance in the fields of orthodontics and orthognathics due to its potential for automating key point labeling. In the context of landmark detection, particularly in cephalometrics, it has been observed that existing methods often lack standardized pipelines and well-designed bias reduction processes, which significantly impact their… ▽ More Accurate localization of cephalometric landmarks holds great importance in the fields of orthodontics and orthognathics due to its potential for automating key point labeling. In the context of landmark detection, particularly in cephalometrics, it has been observed that existing methods often lack standardized pipelines and well-designed bias reduction processes, which significantly impact their performance. In this paper, we revisit a related task, human pose estimation (HPE), which shares numerous similarities with cephalometric landmark detection (CLD), and emphasize the potential for transferring techniques from the former field to benefit the latter. Motivated by this insight, we have developed a robust and adaptable benchmark based on the well-established HPE codebase known as MMPose. This benchmark can serve as a dependable baseline for achieving exceptional CLD performance. Furthermore, we introduce an upscaling design within the framework to further enhance performance. This enhancement involves the incorporation of a lightweight and efficient super-resolution module, which generates heatmap predictions on high-resolution features and leads to further performance refinement, benefiting from its ability to reduce quantization bias. In the MICCAI CLDetection2023 challenge, our method achieves 1st place ranking on three metrics and 3rd place on the remaining one. The code for our method is available at https://github.com/5k5000/CLdetection2023. △ Less

Submitted 29 September, 2023; originally announced September 2023.

arXiv:2204.08129 [pdf, other]

Animal Kingdom: A Large and Diverse Dataset for Animal Behavior Understanding

Authors: Xun Long Ng, Kian Eng Ong, Qichen Zheng, Yun Ni, Si Yong Yeo, Jun Liu

Abstract: Understanding animals' behaviors is significant for a wide range of applications. However, existing animal behavior datasets have limitations in multiple aspects, including limited numbers of animal classes, data samples and provided tasks, and also limited variations in environmental conditions and viewpoints. To address these limitations, we create a large and diverse dataset, Animal Kingdom, th… ▽ More Understanding animals' behaviors is significant for a wide range of applications. However, existing animal behavior datasets have limitations in multiple aspects, including limited numbers of animal classes, data samples and provided tasks, and also limited variations in environmental conditions and viewpoints. To address these limitations, we create a large and diverse dataset, Animal Kingdom, that provides multiple annotated tasks to enable a more thorough understanding of natural animal behaviors. The wild animal footages used in our dataset record different times of the day in extensive range of environments containing variations in backgrounds, viewpoints, illumination and weather conditions. More specifically, our dataset contains 50 hours of annotated videos to localize relevant animal behavior segments in long videos for the video grounding task, 30K video sequences for the fine-grained multi-label action recognition task, and 33K frames for the pose estimation task, which correspond to a diverse range of animals with 850 species across 6 major animal classes. Such a challenging and comprehensive dataset shall be able to facilitate the community to develop, adapt, and evaluate various types of advanced methods for animal behavior analysis. Moreover, we propose a Collaborative Action Recognition (CARe) model that learns general and specific features for action recognition with unseen new animals. This method achieves promising performance in our experiments. Our dataset can be found at https://sutdcv.github.io/Animal-Kingdom. △ Less

Submitted 3 June, 2022; v1 submitted 17 April, 2022; originally announced April 2022.

Comments: Accepted by CVPR2022 (Oral). Dataset: https://sutdcv.github.io/Animal-Kingdom

arXiv:1703.01025 [pdf]

A Novel Multi-task Deep Learning Model for Skin Lesion Segmentation and Classification

Authors: Xulei Yang, Zeng Zeng, Si Yong Yeo, Colin Tan, Hong Liang Tey, Yi Su

Abstract: In this study, a multi-task deep neural network is proposed for skin lesion analysis. The proposed multi-task learning model solves different tasks (e.g., lesion segmentation and two independent binary lesion classifications) at the same time by exploiting commonalities and differences across tasks. This results in improved learning efficiency and potential prediction accuracy for the task-specifi… ▽ More In this study, a multi-task deep neural network is proposed for skin lesion analysis. The proposed multi-task learning model solves different tasks (e.g., lesion segmentation and two independent binary lesion classifications) at the same time by exploiting commonalities and differences across tasks. This results in improved learning efficiency and potential prediction accuracy for the task-specific models, when compared to training the individual models separately. The proposed multi-task deep learning model is trained and evaluated on the dermoscopic image sets from the International Skin Imaging Collaboration (ISIC) 2017 Challenge - Skin Lesion Analysis towards Melanoma Detection, which consists of 2000 training samples and 150 evaluation samples. The experimental results show that the proposed multi-task deep learning model achieves promising performances on skin lesion segmentation and classification. The average value of Jaccard index for lesion segmentation is 0.724, while the average values of area under the receiver operating characteristic curve (AUC) on two individual lesion classifications are 0.880 and 0.972, respectively. △ Less

Submitted 2 March, 2017; originally announced March 2017.

Comments: Submission to support ISIC 2017 challenge results

Showing 1–12 of 12 results for author: Yeo, S Y