-
PromptSR: Cascade Prompting for Lightweight Image Super-Resolution
Authors:
Wenyang Liu,
Chen Cai,
Jianjun Gao,
Kejun Wu,
Yi Wang,
Kim-Hui Yap,
Lap-Pui Chau
Abstract:
Although the lightweight Vision Transformer has significantly advanced image super-resolution (SR), it faces the inherent challenge of a limited receptive field due to the window-based self-attention modeling. The quadratic computational complexity relative to window size restricts its ability to use a large window size for expanding the receptive field while maintaining low computational costs. T…
▽ More
Although the lightweight Vision Transformer has significantly advanced image super-resolution (SR), it faces the inherent challenge of a limited receptive field due to the window-based self-attention modeling. The quadratic computational complexity relative to window size restricts its ability to use a large window size for expanding the receptive field while maintaining low computational costs. To address this challenge, we propose PromptSR, a novel prompt-empowered lightweight image SR method. The core component is the proposed cascade prompting block (CPB), which enhances global information access and local refinement via three cascaded prompting layers: a global anchor prompting layer (GAPL) and two local prompting layers (LPLs). The GAPL leverages downscaled features as anchors to construct low-dimensional anchor prompts (APs) through cross-scale attention, significantly reducing computational costs. These APs, with enhanced global perception, are then used to provide global prompts, efficiently facilitating long-range token connections. The two LPLs subsequently combine category-based self-attention and window-based self-attention to refine the representation in a coarse-to-fine manner. They leverage attention maps from the GAPL as additional global prompts, enabling them to perceive features globally at different granularities for adaptive local refinement. In this way, the proposed CPB effectively combines global priors and local details, significantly enlarging the receptive field while maintaining the low computational costs of our PromptSR. The experimental results demonstrate the superiority of our method, which outperforms state-of-the-art lightweight SR methods in quantitative, qualitative, and complexity evaluations. Our code will be released at https://github.com/wenyang001/PromptSR.
△ Less
Submitted 5 July, 2025;
originally announced July 2025.
-
SSH-Net: A Self-Supervised and Hybrid Network for Noisy Image Watermark Removal
Authors:
Wenyang Liu,
Jianjun Gao,
Kim-Hui Yap
Abstract:
Visible watermark removal is challenging due to its inherent complexities and the noise carried within images. Existing methods primarily rely on supervised learning approaches that require paired datasets of watermarked and watermark-free images, which are often impractical to obtain in real-world scenarios. To address this challenge, we propose SSH-Net, a Self-Supervised and Hybrid Network speci…
▽ More
Visible watermark removal is challenging due to its inherent complexities and the noise carried within images. Existing methods primarily rely on supervised learning approaches that require paired datasets of watermarked and watermark-free images, which are often impractical to obtain in real-world scenarios. To address this challenge, we propose SSH-Net, a Self-Supervised and Hybrid Network specifically designed for noisy image watermark removal. SSH-Net synthesizes reference watermark-free images using the watermark distribution in a self-supervised manner and adopts a dual-network design to address the task. The upper network, focused on the simpler task of noise removal, employs a lightweight CNN-based architecture, while the lower network, designed to handle the more complex task of simultaneously removing watermarks and noise, incorporates Transformer blocks to model long-range dependencies and capture intricate image features. To enhance the model's effectiveness, a shared CNN-based feature encoder is introduced before dual networks to extract common features that both networks can leverage. Our code will be available at https://github.com/wenyang001/SSH-Net.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
Investigation of Tactile Texture Simulation on Online Shopping Experience
Authors:
Pei Hsin Lim,
Kian Meng Yap
Abstract:
With safety measures towards the current Covid-19 pandemic, many retails clothing stores have restricted on-site fittings and shifted their business online. Inability to touch on product evaluations shows an apparent limitation as compared to retail shopping especially when the object's material information is crucial like clothing. Haptic technologies show potential of bridging the gap between on…
▽ More
With safety measures towards the current Covid-19 pandemic, many retails clothing stores have restricted on-site fittings and shifted their business online. Inability to touch on product evaluations shows an apparent limitation as compared to retail shopping especially when the object's material information is crucial like clothing. Haptic technologies show potential of bridging the gap between online shops and the shoppers by providing a sense of touch, yet little research has been done especially on the effect of the simulation of tactile texture on the shopping experience. In this study, we modified a mock-up e-commerce website by adding clothing products and enabling a mid-air haptic interface with Ultrahaptics Evaluation Kit (UHEV1). We developed texture sensations using Time Point Streaming (TSP) modulation for clothing products with different texture materials and a user study was carried out to investigate the tactile texture sensation on shoppers' experience in evaluating online products. Our results show that tactile texture sensation using multipoint mid-air haptic feedback improves online shopper's satisfaction on the product browsing experience. This study contributes to the improvement of general lifestyle of the society in terms of e-commerce experience and could expand its application to impact different sectors like education and different communities including the visually impaired.
△ Less
Submitted 7 November, 2024;
originally announced November 2024.
-
Haptic VR Simulation for Surgery Procedures in Medical Training
Authors:
Lim Zheng Jie,
Kian Meng Yap
Abstract:
Traditional medical training faces challenges like ethical concerns, safety risks, and high costs. VR technology offers a promising solution but is limited by low complexity and lack of tactile feedback. This paper presents a cost-effective haptic VR surgery simulation which simulates realistic Kidney Transplant using commercial devices to enhance training authenticity and immersion. Trainees can…
▽ More
Traditional medical training faces challenges like ethical concerns, safety risks, and high costs. VR technology offers a promising solution but is limited by low complexity and lack of tactile feedback. This paper presents a cost-effective haptic VR surgery simulation which simulates realistic Kidney Transplant using commercial devices to enhance training authenticity and immersion. Trainees can conduct incision and anastomosis procedures using a haptic stylus device that provides tactile sensations. Results from the test with medical participants showed that haptic feedback positively enhances the VR medical training experience.
△ Less
Submitted 7 November, 2024;
originally announced November 2024.
-
Innovative Weight Simulation in Virtual Reality Cube Games: A Pseudo-Haptic Approach
Authors:
Woan Ning Lim,
Edric Yi Junn Leong,
Yun Li Lee,
Kian Meng Yap
Abstract:
This paper presents an innovative pseudo-haptic model for weight simulation in virtual reality (VR) environments. By integrating visual feedback with voluntary exerted force through a passive haptic glove, the model creates haptic illusions of weight perception. Two VR cube games were developed to evaluate the model's effectiveness. The first game assesses participants' ability to discriminate rel…
▽ More
This paper presents an innovative pseudo-haptic model for weight simulation in virtual reality (VR) environments. By integrating visual feedback with voluntary exerted force through a passive haptic glove, the model creates haptic illusions of weight perception. Two VR cube games were developed to evaluate the model's effectiveness. The first game assesses participants' ability to discriminate relative weights, while the second evaluates their capability to estimate absolute weights. Twelve participants, aged 18 to 59, tested the games. Results suggest that the pseudo-haptic model is effective for relative weight discrimination tasks and holds potential for various VR applications. Further research with a larger participant group and more complex scenarios is recommended to refine and validate the model.
△ Less
Submitted 7 November, 2024;
originally announced November 2024.
-
Socially Assistive Robots: A Technological Approach to Emotional Support
Authors:
Leanne Oon Hui Yee,
Siew Sui Fun,
Thit Sar Zin,
Zar Nie Aung,
Kian Meng Yap,
Jiehan Teoh
Abstract:
In today's high-pressure and isolated society, the demand for emotional support has surged, necessitating innovative solutions. Socially Assistive Robots (SARs) offer a technological approach to providing emotional assistance by leveraging advanced robotics, artificial intelligence, and sensor technologies. This study explores the development of an emotional support robot designed to detect and re…
▽ More
In today's high-pressure and isolated society, the demand for emotional support has surged, necessitating innovative solutions. Socially Assistive Robots (SARs) offer a technological approach to providing emotional assistance by leveraging advanced robotics, artificial intelligence, and sensor technologies. This study explores the development of an emotional support robot designed to detect and respond to human emotions, particularly sadness, through facial recognition and gesture analysis. Utilising the Lego Mindstorms Robotic Kit, Raspberry Pi 4, and various Python libraries, the robot is capable of delivering empathetic interactions, including comforting hugs and AI-generated conversations. Experimental findings highlight the robot's effective facial recognition accuracy, user interaction, and hug feedback mechanisms. These results demonstrate the feasibility of using SARs for emotional support, showcasing their potential features and functions. This research underscores the promise of SARs in providing innovative emotional assistance and enhancing human-robot interaction.
△ Less
Submitted 7 November, 2024;
originally announced November 2024.
-
Enhancing Medical Anatomy Education through Virtual Reality (VR): Design, Development, and Evaluation
Authors:
Myint Zu Than,
Kian Meng Yap
Abstract:
Modern medicine demands innovations in medical education, particularly in the learning of human anatomy, traditionally taught through textbooks, dissections, and lectures. Virtual Reality (VR) has emerged as a promising tool to address the limitations of these conventional methods by emphasising vision-based and active learning. However, current VR educational tools are often inaccessible due to h…
▽ More
Modern medicine demands innovations in medical education, particularly in the learning of human anatomy, traditionally taught through textbooks, dissections, and lectures. Virtual Reality (VR) has emerged as a promising tool to address the limitations of these conventional methods by emphasising vision-based and active learning. However, current VR educational tools are often inaccessible due to high costs and specialised equipment requirements. This paper details the design and development of an accessible, desktop-based VR system aimed at enhancing anatomy education by leveraging the user's visual perception to promote a meaningful and interactive learning experience. The Virtual Anatomy Lab was designed to enable students to interact with a 3D Skull model to complete tasks virtually via an interactive user interface (UI) with the help of common devices like a mouse and keyboard. As part of the study, a group of medical students from prestigious medical schools throughout Malaysia were invited to evaluate the built system to offer feedback and determine its overall efficiency and usability in fulfilling their learning goals. The results and findings from user evaluations were then analysed to discuss its effectiveness and areas for future improvement.
△ Less
Submitted 7 November, 2024;
originally announced November 2024.
-
ByteNet: Rethinking Multimedia File Fragment Classification through Visual Perspectives
Authors:
Wenyang Liu,
Kejun Wu,
Tianyi Liu,
Yi Wang,
Kim-Hui Yap,
Lap-Pui Chau
Abstract:
Multimedia file fragment classification (MFFC) aims to identify file fragment types, e.g., image/video, audio, and text without system metadata. It is of vital importance in multimedia storage and communication. Existing MFFC methods typically treat fragments as 1D byte sequences and emphasize the relations between separate bytes (interbytes) for classification. However, the more informative relat…
▽ More
Multimedia file fragment classification (MFFC) aims to identify file fragment types, e.g., image/video, audio, and text without system metadata. It is of vital importance in multimedia storage and communication. Existing MFFC methods typically treat fragments as 1D byte sequences and emphasize the relations between separate bytes (interbytes) for classification. However, the more informative relations inside bytes (intrabytes) are overlooked and seldom investigated. By looking inside bytes, the bit-level details of file fragments can be accessed, enabling a more accurate classification. Motivated by this, we first propose Byte2Image, a novel visual representation model that incorporates previously overlooked intrabyte information into file fragments and reinterprets these fragments as 2D grayscale images. This model involves a sliding byte window to reveal the intrabyte information and a rowwise stacking of intrabyte ngrams for embedding fragments into a 2D space. Thus, complex interbyte and intrabyte correlations can be mined simultaneously using powerful vision networks. Additionally, we propose an end-to-end dual-branch network ByteNet to enhance robust correlation mining and feature representation. ByteNet makes full use of the raw 1D byte sequence and the converted 2D image through a shallow byte branch feature extraction (BBFE) and a deep image branch feature extraction (IBFE) network. In particular, the BBFE, composed of a single fully-connected layer, adaptively recognizes the co-occurrence of several some specific bytes within the raw byte sequence, while the IBFE, built on a vision Transformer, effectively mines the complex interbyte and intrabyte correlations from the converted image. Experiments on the two representative benchmarks, including 14 cases, validate that our proposed method outperforms state-of-the-art approaches on different cases by up to 12.2%.
△ Less
Submitted 28 October, 2024;
originally announced October 2024.
-
CL-HOI: Cross-Level Human-Object Interaction Distillation from Vision Large Language Models
Authors:
Jianjun Gao,
Chen Cai,
Ruoyu Wang,
Wenyang Liu,
Kim-Hui Yap,
Kratika Garg,
Boon-Siew Han
Abstract:
Human-object interaction (HOI) detection has seen advancements with Vision Language Models (VLMs), but these methods often depend on extensive manual annotations. Vision Large Language Models (VLLMs) can inherently recognize and reason about interactions at the image level but are computationally heavy and not designed for instance-level HOI detection. To overcome these limitations, we propose a C…
▽ More
Human-object interaction (HOI) detection has seen advancements with Vision Language Models (VLMs), but these methods often depend on extensive manual annotations. Vision Large Language Models (VLLMs) can inherently recognize and reason about interactions at the image level but are computationally heavy and not designed for instance-level HOI detection. To overcome these limitations, we propose a Cross-Level HOI distillation (CL-HOI) framework, which distills instance-level HOIs from VLLMs image-level understanding without the need for manual annotations. Our approach involves two stages: context distillation, where a Visual Linguistic Translator (VLT) converts visual information into linguistic form, and interaction distillation, where an Interaction Cognition Network (ICN) reasons about spatial, visual, and context relations. We design contrastive distillation losses to transfer image-level context and interaction knowledge from the teacher to the student model, enabling instance-level HOI detection. Evaluations on HICO-DET and V-COCO datasets demonstrate that our CL-HOI surpasses existing weakly supervised methods and VLLM supervised methods, showing its efficacy in detecting HOIs without manual labels.
△ Less
Submitted 21 October, 2024;
originally announced October 2024.
-
Open World Object Detection: A Survey
Authors:
Yiming Li,
Yi Wang,
Wenqian Wang,
Dan Lin,
Bingbing Li,
Kim-Hui Yap
Abstract:
Exploring new knowledge is a fundamental human ability that can be mirrored in the development of deep neural networks, especially in the field of object detection. Open world object detection (OWOD) is an emerging area of research that adapts this principle to explore new knowledge. It focuses on recognizing and learning from objects absent from initial training sets, thereby incrementally expand…
▽ More
Exploring new knowledge is a fundamental human ability that can be mirrored in the development of deep neural networks, especially in the field of object detection. Open world object detection (OWOD) is an emerging area of research that adapts this principle to explore new knowledge. It focuses on recognizing and learning from objects absent from initial training sets, thereby incrementally expanding its knowledge base when new class labels are introduced. This survey paper offers a thorough review of the OWOD domain, covering essential aspects, including problem definitions, benchmark datasets, source codes, evaluation metrics, and a comparative study of existing methods. Additionally, we investigate related areas like open set recognition (OSR) and incremental learning (IL), underlining their relevance to OWOD. Finally, the paper concludes by addressing the limitations and challenges faced by current OWOD algorithms and proposes directions for future research. To our knowledge, this is the first comprehensive survey of the emerging OWOD field with over one hundred references, marking a significant step forward for object detection technology. A comprehensive source code and benchmarks are archived and concluded at https://github.com/ArminLee/OWOD Review.
△ Less
Submitted 28 June, 2025; v1 submitted 15 October, 2024;
originally announced October 2024.
-
Empowering Large Language Model for Continual Video Question Answering with Collaborative Prompting
Authors:
Chen Cai,
Zheng Wang,
Jianjun Gao,
Wenyang Liu,
Ye Lu,
Runzhong Zhang,
Kim-Hui Yap
Abstract:
In recent years, the rapid increase in online video content has underscored the limitations of static Video Question Answering (VideoQA) models trained on fixed datasets, as they struggle to adapt to new questions or tasks posed by newly available content. In this paper, we explore the novel challenge of VideoQA within a continual learning framework, and empirically identify a critical issue: fine…
▽ More
In recent years, the rapid increase in online video content has underscored the limitations of static Video Question Answering (VideoQA) models trained on fixed datasets, as they struggle to adapt to new questions or tasks posed by newly available content. In this paper, we explore the novel challenge of VideoQA within a continual learning framework, and empirically identify a critical issue: fine-tuning a large language model (LLM) for a sequence of tasks often results in catastrophic forgetting. To address this, we propose Collaborative Prompting (ColPro), which integrates specific question constraint prompting, knowledge acquisition prompting, and visual temporal awareness prompting. These prompts aim to capture textual question context, visual content, and video temporal dynamics in VideoQA, a perspective underexplored in prior research. Experimental results on the NExT-QA and DramaQA datasets show that ColPro achieves superior performance compared to existing approaches, achieving 55.14\% accuracy on NExT-QA and 71.24\% accuracy on DramaQA, highlighting its practical relevance and effectiveness.
△ Less
Submitted 16 January, 2025; v1 submitted 1 October, 2024;
originally announced October 2024.
-
MultiFuser: Multimodal Fusion Transformer for Enhanced Driver Action Recognition
Authors:
Ruoyu Wang,
Wenqian Wang,
Jianjun Gao,
Dan Lin,
Kim-Hui Yap,
Bingbing Li
Abstract:
Driver action recognition, aiming to accurately identify drivers' behaviours, is crucial for enhancing driver-vehicle interactions and ensuring driving safety. Unlike general action recognition, drivers' environments are often challenging, being gloomy and dark, and with the development of sensors, various cameras such as IR and depth cameras have emerged for analyzing drivers' behaviors. Therefor…
▽ More
Driver action recognition, aiming to accurately identify drivers' behaviours, is crucial for enhancing driver-vehicle interactions and ensuring driving safety. Unlike general action recognition, drivers' environments are often challenging, being gloomy and dark, and with the development of sensors, various cameras such as IR and depth cameras have emerged for analyzing drivers' behaviors. Therefore, in this paper, we propose a novel multimodal fusion transformer, named MultiFuser, which identifies cross-modal interrelations and interactions among multimodal car cabin videos and adaptively integrates different modalities for improved representations. Specifically, MultiFuser comprises layers of Bi-decomposed Modules to model spatiotemporal features, with a modality synthesizer for multimodal features integration. Each Bi-decomposed Module includes a Modal Expertise ViT block for extracting modality-specific features and a Patch-wise Adaptive Fusion block for efficient cross-modal fusion. Extensive experiments are conducted on Drive&Act dataset and the results demonstrate the efficacy of our proposed approach.
△ Less
Submitted 17 August, 2024; v1 submitted 3 August, 2024;
originally announced August 2024.
-
CM2-Net: Continual Cross-Modal Mapping Network for Driver Action Recognition
Authors:
Ruoyu Wang,
Chen Cai,
Wenqian Wang,
Jianjun Gao,
Dan Lin,
Wenyang Liu,
Kim-Hui Yap
Abstract:
Driver action recognition has significantly advanced in enhancing driver-vehicle interactions and ensuring driving safety by integrating multiple modalities, such as infrared and depth. Nevertheless, compared to RGB modality only, it is always laborious and costly to collect extensive data for all types of non-RGB modalities in car cabin environments. Therefore, previous works have suggested indep…
▽ More
Driver action recognition has significantly advanced in enhancing driver-vehicle interactions and ensuring driving safety by integrating multiple modalities, such as infrared and depth. Nevertheless, compared to RGB modality only, it is always laborious and costly to collect extensive data for all types of non-RGB modalities in car cabin environments. Therefore, previous works have suggested independently learning each non-RGB modality by fine-tuning a model pre-trained on RGB videos, but these methods are less effective in extracting informative features when faced with newly-incoming modalities due to large domain gaps. In contrast, we propose a Continual Cross-Modal Mapping Network (CM2-Net) to continually learn each newly-incoming modality with instructive prompts from the previously-learned modalities. Specifically, we have developed Accumulative Cross-modal Mapping Prompting (ACMP), to map the discriminative and informative features learned from previous modalities into the feature space of newly-incoming modalities. Then, when faced with newly-incoming modalities, these mapped features are able to provide effective prompts for which features should be extracted and prioritized. These prompts are accumulating throughout the continual learning process, thereby boosting further recognition performances. Extensive experiments conducted on the Drive&Act dataset demonstrate the performance superiority of CM2-Net on both uni- and multi-modal driver action recognition.
△ Less
Submitted 3 August, 2024; v1 submitted 17 June, 2024;
originally announced June 2024.
-
Video sentence grounding with temporally global textual knowledge
Authors:
Cai Chen,
Runzhong Zhang,
Jianjun Gao,
Kejun Wu,
Kim-Hui Yap,
Yi Wang
Abstract:
Temporal sentence grounding involves the retrieval of a video moment with a natural language query. Many existing works directly incorporate the given video and temporally localized query for temporal grounding, overlooking the inherent domain gap between different modalities. In this paper, we utilize pseudo-query features containing extensive temporally global textual knowledge sourced from the…
▽ More
Temporal sentence grounding involves the retrieval of a video moment with a natural language query. Many existing works directly incorporate the given video and temporally localized query for temporal grounding, overlooking the inherent domain gap between different modalities. In this paper, we utilize pseudo-query features containing extensive temporally global textual knowledge sourced from the same video-query pair, to enhance the bridging of domain gaps and attain a heightened level of similarity between multi-modal features. Specifically, we propose a Pseudo-query Intermediary Network (PIN) to achieve an improved alignment of visual and comprehensive pseudo-query features within the feature space through contrastive learning. Subsequently, we utilize learnable prompts to encapsulate the knowledge of pseudo-queries, propagating them into the textual encoder and multi-modal fusion module, further enhancing the feature alignment between visual and language for better temporal grounding. Extensive experiments conducted on the Charades-STA and ActivityNet-Captions datasets demonstrate the effectiveness of our method.
△ Less
Submitted 1 June, 2024; v1 submitted 21 April, 2024;
originally announced April 2024.
-
Multi-modality action recognition based on dual feature shift in vehicle cabin monitoring
Authors:
Dan Lin,
Philip Hann Yung Lee,
Yiming Li,
Ruoyu Wang,
Kim-Hui Yap,
Bingbing Li,
You Shing Ngim
Abstract:
Driver Action Recognition (DAR) is crucial in vehicle cabin monitoring systems. In real-world applications, it is common for vehicle cabins to be equipped with cameras featuring different modalities. However, multi-modality fusion strategies for the DAR task within car cabins have rarely been studied. In this paper, we propose a novel yet efficient multi-modality driver action recognition method b…
▽ More
Driver Action Recognition (DAR) is crucial in vehicle cabin monitoring systems. In real-world applications, it is common for vehicle cabins to be equipped with cameras featuring different modalities. However, multi-modality fusion strategies for the DAR task within car cabins have rarely been studied. In this paper, we propose a novel yet efficient multi-modality driver action recognition method based on dual feature shift, named DFS. DFS first integrates complementary features across modalities by performing modality feature interaction. Meanwhile, DFS achieves the neighbour feature propagation within single modalities, by feature shifting among temporal frames. To learn common patterns and improve model efficiency, DFS shares feature extracting stages among multiple modalities. Extensive experiments have been carried out to verify the effectiveness of the proposed DFS model on the Drive\&Act dataset. The results demonstrate that DFS achieves good performance and improves the efficiency of multi-modality driver action recognition.
△ Less
Submitted 26 January, 2024;
originally announced January 2024.
-
Octopus: A Fair Packet Delivery Service
Authors:
Junzhi Gong,
Yuliang Li,
Devdeep Ray,
KK Yap,
Nandita Dukkipati
Abstract:
The packet delivery fairness is critical in many applications in the cloud, such as exchange systems, consensus protocols, and online gaming applications. However, due to nonidentical and dynamic packet forwarding paths, as well as many in-network queuing delays, supporting packet delivery fairness is challenging in a shared compute environment. In this paper, we present Octopus, the first general…
▽ More
The packet delivery fairness is critical in many applications in the cloud, such as exchange systems, consensus protocols, and online gaming applications. However, due to nonidentical and dynamic packet forwarding paths, as well as many in-network queuing delays, supporting packet delivery fairness is challenging in a shared compute environment. In this paper, we present Octopus, the first general fair packet delivery service to achieve packet arrival time variations smaller than tens of nanoseconds, with the existence of latency variations in the network. The key ideas of Octopus to support such good fairness come from repurposing hardware traffic shaping capabilities in modern NICs, and deploying agents at local SmartNICs to minimize latency variations from packet forwarding. Evaluation results show that Octopus has less than 40 ns unfairness for up to 99.97\% multicast packets.
△ Less
Submitted 16 January, 2024;
originally announced January 2024.
-
Learning-Based Biharmonic Augmentation for Point Cloud Classification
Authors:
Jiacheng Wei,
Guosheng Lin,
Henghui Ding,
Jie Hu,
Kim-Hui Yap
Abstract:
Point cloud datasets often suffer from inadequate sample sizes in comparison to image datasets, making data augmentation challenging. While traditional methods, like rigid transformations and scaling, have limited potential in increasing dataset diversity due to their constraints on altering individual sample shapes, we introduce the Biharmonic Augmentation (BA) method. BA is a novel and efficient…
▽ More
Point cloud datasets often suffer from inadequate sample sizes in comparison to image datasets, making data augmentation challenging. While traditional methods, like rigid transformations and scaling, have limited potential in increasing dataset diversity due to their constraints on altering individual sample shapes, we introduce the Biharmonic Augmentation (BA) method. BA is a novel and efficient data augmentation technique that diversifies point cloud data by imposing smooth non-rigid deformations on existing 3D structures. This approach calculates biharmonic coordinates for the deformation function and learns diverse deformation prototypes. Utilizing a CoefNet, our method predicts coefficients to amalgamate these prototypes, ensuring comprehensive deformation. Moreover, we present AdvTune, an advanced online augmentation system that integrates adversarial training. This system synergistically refines the CoefNet and the classification network, facilitating the automated creation of adaptive shape deformations contingent on the learner status. Comprehensive experimental analysis validates the superiority of Biharmonic Augmentation, showcasing notable performance improvements over prevailing point cloud augmentation techniques across varied network designs.
△ Less
Submitted 10 November, 2023;
originally announced November 2023.
-
Bitstream-Corrupted Video Recovery: A Novel Benchmark Dataset and Method
Authors:
Tianyi Liu,
Kejun Wu,
Yi Wang,
Wenyang Liu,
Kim-Hui Yap,
Lap-Pui Chau
Abstract:
The past decade has witnessed great strides in video recovery by specialist technologies, like video inpainting, completion, and error concealment. However, they typically simulate the missing content by manual-designed error masks, thus failing to fill in the realistic video loss in video communication (e.g., telepresence, live streaming, and internet video) and multimedia forensics. To address t…
▽ More
The past decade has witnessed great strides in video recovery by specialist technologies, like video inpainting, completion, and error concealment. However, they typically simulate the missing content by manual-designed error masks, thus failing to fill in the realistic video loss in video communication (e.g., telepresence, live streaming, and internet video) and multimedia forensics. To address this, we introduce the bitstream-corrupted video (BSCV) benchmark, the first benchmark dataset with more than 28,000 video clips, which can be used for bitstream-corrupted video recovery in the real world. The BSCV is a collection of 1) a proposed three-parameter corruption model for video bitstream, 2) a large-scale dataset containing rich error patterns, multiple corruption levels, and flexible dataset branches, and 3) a plug-and-play module in video recovery framework that serves as a benchmark. We evaluate state-of-the-art video inpainting methods on the BSCV dataset, demonstrating existing approaches' limitations and our framework's advantages in solving the bitstream-corrupted video recovery problem. The benchmark and dataset are released at https://github.com/LIUTIGHE/BSCV-Dataset.
△ Less
Submitted 26 September, 2023; v1 submitted 25 September, 2023;
originally announced September 2023.
-
OccluTrack: Rethinking Awareness of Occlusion for Enhancing Multiple Pedestrian Tracking
Authors:
Jianjun Gao,
Yi Wang,
Kim-Hui Yap,
Kratika Garg,
Boon Siew Han
Abstract:
Multiple pedestrian tracking is crucial for enhancing safety and efficiency in intelligent transport and autonomous driving systems by predicting movements and enabling adaptive decision-making in dynamic environments. It optimizes traffic flow, facilitates human interaction, and ensures compliance with regulations. However, it faces the challenge of tracking pedestrians in the presence of occlusi…
▽ More
Multiple pedestrian tracking is crucial for enhancing safety and efficiency in intelligent transport and autonomous driving systems by predicting movements and enabling adaptive decision-making in dynamic environments. It optimizes traffic flow, facilitates human interaction, and ensures compliance with regulations. However, it faces the challenge of tracking pedestrians in the presence of occlusion. Existing methods overlook effects caused by abnormal detections during partial occlusion. Subsequently, these abnormal detections can lead to inaccurate motion estimation, unreliable appearance features, and unfair association. To address these issues, we propose an adaptive occlusion-aware multiple pedestrian tracker, OccluTrack, to mitigate the effects caused by partial occlusion. Specifically, we first introduce a plug-and-play abnormal motion suppression mechanism into the Kalman Filter to adaptively detect and suppress outlier motions caused by partial occlusion. Second, we develop a pose-guided re-identification (Re-ID) module to extract discriminative part features for partially occluded pedestrians. Last, we develop a new occlusion-aware association method towards fair Intersection over Union (IoU) and appearance embedding distance measurement for occluded pedestrians. Extensive evaluation results demonstrate that our method outperforms state-of-the-art methods on MOTChallenge and DanceTrack datasets. Particularly, the performance improvements on IDF1 and ID Switches, as well as visualized results, demonstrate the effectiveness of our method in multiple pedestrian tracking.
△ Less
Submitted 26 April, 2025; v1 submitted 19 September, 2023;
originally announced September 2023.
-
Ladder-of-Thought: Using Knowledge as Steps to Elevate Stance Detection
Authors:
Kairui Hu,
Ming Yan,
Joey Tianyi Zhou,
Ivor W. Tsang,
Wen Haw Chong,
Yong Keong Yap
Abstract:
Stance detection aims to identify the attitude expressed in a document towards a given target. Techniques such as Chain-of-Thought (CoT) prompting have advanced this task, enhancing a model's reasoning capabilities through the derivation of intermediate rationales. However, CoT relies primarily on a model's pre-trained internal knowledge during reasoning, thereby neglecting the valuable external i…
▽ More
Stance detection aims to identify the attitude expressed in a document towards a given target. Techniques such as Chain-of-Thought (CoT) prompting have advanced this task, enhancing a model's reasoning capabilities through the derivation of intermediate rationales. However, CoT relies primarily on a model's pre-trained internal knowledge during reasoning, thereby neglecting the valuable external information that is previously unknown to the model. This omission, especially within the unsupervised reasoning process, can affect the model's overall performance. Moreover, while CoT enhances Large Language Models (LLMs), smaller LMs, though efficient operationally, face challenges in delivering nuanced reasoning. In response to these identified gaps, we introduce the Ladder-of-Thought (LoT) for the stance detection task. Constructed through a dual-phase Progressive Optimization Framework, LoT directs the small LMs to assimilate high-quality external knowledge, refining the intermediate rationales produced. These bolstered rationales subsequently serve as the foundation for more precise predictions - akin to how a ladder facilitates reaching elevated goals. LoT achieves a balance between efficiency and performance. Our empirical evaluations underscore LoT's efficacy, marking a 16% improvement over GPT-3.5 and a 10% enhancement compared to GPT-3.5 with CoT on stance detection task.
△ Less
Submitted 7 September, 2023; v1 submitted 31 August, 2023;
originally announced August 2023.
-
Top-Down Framework for Weakly-supervised Grounded Image Captioning
Authors:
Chen Cai,
Suchen Wang,
Kim-hui Yap,
Yi Wang
Abstract:
Weakly-supervised grounded image captioning (WSGIC) aims to generate the caption and ground (localize) predicted object words in the input image without using bounding box supervision. Recent two-stage solutions mostly apply a bottom-up pipeline: (1) encode the input image into multiple region features using an object detector; (2) leverage region features for captioning and grounding. However, ut…
▽ More
Weakly-supervised grounded image captioning (WSGIC) aims to generate the caption and ground (localize) predicted object words in the input image without using bounding box supervision. Recent two-stage solutions mostly apply a bottom-up pipeline: (1) encode the input image into multiple region features using an object detector; (2) leverage region features for captioning and grounding. However, utilizing independent proposals produced by object detectors tends to make the subsequent grounded captioner overfitted in finding the correct object words, overlooking the relation between objects, and selecting incompatible proposal regions for grounding. To address these issues, we propose a one-stage weakly-supervised grounded captioner that directly takes the RGB image as input to perform captioning and grounding at the top-down image level. Specifically, we encode the image into visual token representations and propose a Recurrent Grounding Module (RGM) in the decoder to obtain precise Visual Language Attention Maps (VLAMs), which recognize the spatial locations of the objects. In addition, we explicitly inject a relation module into our one-stage framework to encourage relation understanding through multi-label classification. This relation semantics served as contextual information facilitating the prediction of relation and object words in the caption. We observe that the relation semantic not only assists the grounded captioner in generating a more accurate caption but also improves the grounding performance. We validate the effectiveness of our proposed method on two challenging datasets (Flick30k Entities captioning and MSCOCO captioning). The experimental results demonstrate that our method achieves state-of-the-art grounding performance.
△ Less
Submitted 2 March, 2024; v1 submitted 12 June, 2023;
originally announced June 2023.
-
Guiding Computational Stance Detection with Expanded Stance Triangle Framework
Authors:
Zhengyuan Liu,
Yong Keong Yap,
Hai Leong Chieu,
Nancy F. Chen
Abstract:
Stance detection determines whether the author of a piece of text is in favor of, against, or neutral towards a specified target, and can be used to gain valuable insights into social media. The ubiquitous indirect referral of targets makes this task challenging, as it requires computational solutions to model semantic features and infer the corresponding implications from a literal statement. Mor…
▽ More
Stance detection determines whether the author of a piece of text is in favor of, against, or neutral towards a specified target, and can be used to gain valuable insights into social media. The ubiquitous indirect referral of targets makes this task challenging, as it requires computational solutions to model semantic features and infer the corresponding implications from a literal statement. Moreover, the limited amount of available training data leads to subpar performance in out-of-domain and cross-target scenarios, as data-driven approaches are prone to rely on superficial and domain-specific features. In this work, we decompose the stance detection task from a linguistic perspective, and investigate key components and inference paths in this task. The stance triangle is a generic linguistic framework previously proposed to describe the fundamental ways people express their stance. We further expand it by characterizing the relationship between explicit and implicit objects. We then use the framework to extend one single training corpus with additional annotation. Experimental results show that strategically-enriched data can significantly improve the performance on out-of-domain and cross-target evaluation.
△ Less
Submitted 31 May, 2023;
originally announced May 2023.
-
SSN: Stockwell Scattering Network for SAR Image Change Detection
Authors:
Gong Chen,
Yanan Zhao,
Yi Wang,
Kim-Hui Yap
Abstract:
Recently, synthetic aperture radar (SAR) image change detection has become an interesting yet challenging direction due to the presence of speckle noise. Although both traditional and modern learning-driven methods attempted to overcome this challenge, deep convolutional neural networks (DCNNs)-based methods are still hindered by the lack of interpretability and the requirement of large computatio…
▽ More
Recently, synthetic aperture radar (SAR) image change detection has become an interesting yet challenging direction due to the presence of speckle noise. Although both traditional and modern learning-driven methods attempted to overcome this challenge, deep convolutional neural networks (DCNNs)-based methods are still hindered by the lack of interpretability and the requirement of large computation power. To overcome this drawback, wavelet scattering network (WSN) and Fourier scattering network (FSN) are proposed. Combining respective merits of WSN and FSN, we propose Stockwell scattering network (SSN) based on Stockwell transform which is widely applied against noisy signals and shows advantageous characteristics in speckle reduction. The proposed SSN provides noise-resilient feature representation and obtains state-of-art performance in SAR image change detection as well as high computational efficiency. Experimental results on three real SAR image datasets demonstrate the effectiveness of the proposed method.
△ Less
Submitted 22 April, 2023;
originally announced April 2023.
-
A Byte Sequence is Worth an Image: CNN for File Fragment Classification Using Bit Shift and n-Gram Embeddings
Authors:
Wenyang Liu,
Yi Wang,
Kejun Wu,
Kim-Hui Yap,
Lap-Pui Chau
Abstract:
File fragment classification (FFC) on small chunks of memory is essential in memory forensics and Internet security. Existing methods mainly treat file fragments as 1d byte signals and utilize the captured inter-byte features for classification, while the bit information within bytes, i.e., intra-byte information, is seldom considered. This is inherently inapt for classifying variable-length codin…
▽ More
File fragment classification (FFC) on small chunks of memory is essential in memory forensics and Internet security. Existing methods mainly treat file fragments as 1d byte signals and utilize the captured inter-byte features for classification, while the bit information within bytes, i.e., intra-byte information, is seldom considered. This is inherently inapt for classifying variable-length coding files whose symbols are represented as the variable number of bits. Conversely, we propose Byte2Image, a novel data augmentation technique, to introduce the neglected intra-byte information into file fragments and re-treat them as 2d gray-scale images, which allows us to capture both inter-byte and intra-byte correlations simultaneously through powerful convolutional neural networks (CNNs). Specifically, to convert file fragments to 2d images, we employ a sliding byte window to expose the neglected intra-byte information and stack their n-gram features row by row. We further propose a byte sequence \& image fusion network as a classifier, which can jointly model the raw 1d byte sequence and the converted 2d image to perform FFC. Experiments on FFT-75 dataset validate that our proposed method can achieve notable accuracy improvements over state-of-the-art methods in nearly all scenarios. The code will be released at https://github.com/wenyang001/Byte2Image.
△ Less
Submitted 14 April, 2023;
originally announced April 2023.
-
Bitstream-Corrupted JPEG Images are Restorable: Two-stage Compensation and Alignment Framework for Image Restoration
Authors:
Wenyang Liu,
Yi Wang,
Kim-Hui Yap,
Lap-Pui Chau
Abstract:
In this paper, we study a real-world JPEG image restoration problem with bit errors on the encrypted bitstream. The bit errors bring unpredictable color casts and block shifts on decoded image contents, which cannot be resolved by existing image restoration methods mainly relying on pre-defined degradation models in the pixel domain. To address these challenges, we propose a robust JPEG decoder, f…
▽ More
In this paper, we study a real-world JPEG image restoration problem with bit errors on the encrypted bitstream. The bit errors bring unpredictable color casts and block shifts on decoded image contents, which cannot be resolved by existing image restoration methods mainly relying on pre-defined degradation models in the pixel domain. To address these challenges, we propose a robust JPEG decoder, followed by a two-stage compensation and alignment framework to restore bitstream-corrupted JPEG images. Specifically, the robust JPEG decoder adopts an error-resilient mechanism to decode the corrupted JPEG bitstream. The two-stage framework is composed of the self-compensation and alignment (SCA) stage and the guided-compensation and alignment (GCA) stage. The SCA adaptively performs block-wise image color compensation and alignment based on the estimated color and block offsets via image content similarity. The GCA leverages the extracted low-resolution thumbnail from the JPEG header to guide full-resolution pixel-wise image restoration in a coarse-to-fine manner. It is achieved by a coarse-guided pix2pix network and a refine-guided bi-directional Laplacian pyramid fusion network. We conduct experiments on three benchmarks with varying degrees of bit error rates. Experimental results and ablation studies demonstrate the superiority of our proposed method. The code will be released at https://github.com/wenyang001/Two-ACIR.
△ Less
Submitted 14 April, 2023;
originally announced April 2023.
-
TAPS3D: Text-Guided 3D Textured Shape Generation from Pseudo Supervision
Authors:
Jiacheng Wei,
Hao Wang,
Jiashi Feng,
Guosheng Lin,
Kim-Hui Yap
Abstract:
In this paper, we investigate an open research task of generating controllable 3D textured shapes from the given textual descriptions. Previous works either require ground truth caption labeling or extensive optimization time. To resolve these issues, we present a novel framework, TAPS3D, to train a text-guided 3D shape generator with pseudo captions. Specifically, based on rendered 2D images, we…
▽ More
In this paper, we investigate an open research task of generating controllable 3D textured shapes from the given textual descriptions. Previous works either require ground truth caption labeling or extensive optimization time. To resolve these issues, we present a novel framework, TAPS3D, to train a text-guided 3D shape generator with pseudo captions. Specifically, based on rendered 2D images, we retrieve relevant words from the CLIP vocabulary and construct pseudo captions using templates. Our constructed captions provide high-level semantic supervision for generated 3D shapes. Further, in order to produce fine-grained textures and increase geometry diversity, we propose to adopt low-level image regularization to enable fake-rendered images to align with the real ones. During the inference phase, our proposed model can generate 3D textured shapes from the given text without any additional optimization. We conduct extensive experiments to analyze each of our proposed components and show the efficacy of our framework in generating high-fidelity 3D textured and text-relevant shapes.
△ Less
Submitted 23 March, 2023;
originally announced March 2023.
-
Dense Supervision Propagation for Weakly Supervised Semantic Segmentation on 3D Point Clouds
Authors:
Jiacheng Wei,
Guosheng Lin,
Kim-Hui Yap,
Fayao Liu,
Tzu-Yi Hung
Abstract:
Semantic segmentation on 3D point clouds is an important task for 3D scene understanding. While dense labeling on 3D data is expensive and time-consuming, only a few works address weakly supervised semantic point cloud segmentation methods to relieve the labeling cost by learning from simpler and cheaper labels. Meanwhile, there are still huge performance gaps between existing weakly supervised me…
▽ More
Semantic segmentation on 3D point clouds is an important task for 3D scene understanding. While dense labeling on 3D data is expensive and time-consuming, only a few works address weakly supervised semantic point cloud segmentation methods to relieve the labeling cost by learning from simpler and cheaper labels. Meanwhile, there are still huge performance gaps between existing weakly supervised methods and state-of-the-art fully supervised methods. In this paper, we train a semantic point cloud segmentation network with only a small portion of points being labeled. We argue that we can better utilize the limited supervision information as we densely propagate the supervision signal from the labeled points to other points within and across the input samples. Specifically, we propose a cross-sample feature reallocating module to transfer similar features and therefore re-route the gradients across two samples with common classes and an intra-sample feature redistribution module to propagate supervision signals on unlabeled points across and within point cloud samples. We conduct extensive experiments on public datasets S3DIS and ScanNet. Our weakly supervised method with only 10% and 1% of labels can produce compatible results with the fully supervised counterpart.
△ Less
Submitted 1 April, 2024; v1 submitted 23 July, 2021;
originally announced July 2021.
-
Reconciliation of Statistical and Spatial Sparsity For Robust Image and Image-Set Classification
Authors:
Hao Cheng,
Kim-Hui Yap,
Bihan Wen
Abstract:
Recent image classification algorithms, by learning deep features from large-scale datasets, have achieved significantly better results comparing to the classic feature-based approaches. However, there are still various challenges of image classifications in practice, such as classifying noisy image or image-set queries and training deep image classification models over the limited-scale dataset.…
▽ More
Recent image classification algorithms, by learning deep features from large-scale datasets, have achieved significantly better results comparing to the classic feature-based approaches. However, there are still various challenges of image classifications in practice, such as classifying noisy image or image-set queries and training deep image classification models over the limited-scale dataset. Instead of applying generic deep features, the model-based approaches can be more effective and data-efficient for robust image and image-set classification tasks, as various image priors are exploited for modeling the inter- and intra-set data variations while preventing over-fitting. In this work, we propose a novel Joint Statistical and Spatial Sparse representation, dubbed \textit{J3S}, to model the image or image-set data for classification, by reconciling both their local patch structures and global Gaussian distribution mapped into Riemannian manifold. To the best of our knowledge, no work to date utilized both global statistics and local patch structures jointly via joint sparse representation. We propose to solve the joint sparse coding problem based on the J3S model, by coupling the local and global image representations using joint sparsity. The learned J3S models are used for robust image and image-set classification. Experiments show that the proposed J3S-based image classification scheme outperforms the popular or state-of-the-art competing methods over FMD, UIUC, ETH-80 and YTC databases.
△ Less
Submitted 1 June, 2021;
originally announced June 2021.
-
The D-plus Discriminant and Complexity of Root Clustering
Authors:
Jing Yang,
Chee K. Yap
Abstract:
Let $p(x)$ be an integer polynomial with $m\ge 2$ distinct roots $ρ_1,\ldots,ρ_m$ whose multiplicities are $\boldsymbolμ=(μ_1,\ldots,μ_m)$. We define the D-plus discriminant of $p(x)$ to be $D^+(p):= \prod_{1\le i<j\le m}(ρ_i-ρ_j)^{μ_i+μ_j}$. We first prove a conjecture that $D^+(p)$ is a $\boldsymbolμ$-symmetric function of its roots $ρ_1,\ldots,ρ_m$. Our main result gives an explicit formula for…
▽ More
Let $p(x)$ be an integer polynomial with $m\ge 2$ distinct roots $ρ_1,\ldots,ρ_m$ whose multiplicities are $\boldsymbolμ=(μ_1,\ldots,μ_m)$. We define the D-plus discriminant of $p(x)$ to be $D^+(p):= \prod_{1\le i<j\le m}(ρ_i-ρ_j)^{μ_i+μ_j}$. We first prove a conjecture that $D^+(p)$ is a $\boldsymbolμ$-symmetric function of its roots $ρ_1,\ldots,ρ_m$. Our main result gives an explicit formula for $D^+(p)$, as a rational function of its coefficients. Our proof is ideal-theoretic, based on re-casting the classic Poisson resultant as the "symbolic Poisson formula". The D-plus discriminant first arose in the complexity analysis of a root clustering algorithm from Becker et al. (ISSAC 2016). The bit-complexity of this algorithm is proportional to a quantity $\log(|D^+(p)|^{-1})$. As an application of our main result, we give an explicit upper bound on this quantity in terms of the degree of $p$ and its leading coefficient.
△ Less
Submitted 19 May, 2021; v1 submitted 9 May, 2021;
originally announced May 2021.
-
Empirical Analysis of Overfitting and Mode Drop in GAN Training
Authors:
Yasin Yazici,
Chuan-Sheng Foo,
Stefan Winkler,
Kim-Hui Yap,
Vijay Chandrasekhar
Abstract:
We examine two key questions in GAN training, namely overfitting and mode drop, from an empirical perspective. We show that when stochasticity is removed from the training procedure, GANs can overfit and exhibit almost no mode drop. Our results shed light on important characteristics of the GAN training procedure. They also provide evidence against prevailing intuitions that GANs do not memorize t…
▽ More
We examine two key questions in GAN training, namely overfitting and mode drop, from an empirical perspective. We show that when stochasticity is removed from the training procedure, GANs can overfit and exhibit almost no mode drop. Our results shed light on important characteristics of the GAN training procedure. They also provide evidence against prevailing intuitions that GANs do not memorize the training set, and that mode dropping is mainly due to properties of the GAN objective rather than how it is optimized during training.
△ Less
Submitted 25 June, 2020;
originally announced June 2020.
-
Arnold: an eFPGA-Augmented RISC-V SoC for Flexible and Low-Power IoT End-Nodes
Authors:
Pasquale Davide Schiavone,
Davide Rossi,
Alfio Di Mauro,
Frank Gurkaynak,
Timothy Saxe,
Mao Wang,
Ket Chong Yap,
Luca Benini
Abstract:
A wide range of Internet of Things (IoT) applications require powerful, energy-efficient and flexible end-nodes to acquire data from multiple sources, process and distill the sensed data through near-sensor data analytics algorithms, and transmit it wirelessly. This work presents Arnold: a 0.5 V to 0.8 V, 46.83 uW/MHz, 600 MOPS fully programmable RISC-V Microcontroller unit (MCU) fabricated in 22…
▽ More
A wide range of Internet of Things (IoT) applications require powerful, energy-efficient and flexible end-nodes to acquire data from multiple sources, process and distill the sensed data through near-sensor data analytics algorithms, and transmit it wirelessly. This work presents Arnold: a 0.5 V to 0.8 V, 46.83 uW/MHz, 600 MOPS fully programmable RISC-V Microcontroller unit (MCU) fabricated in 22 nm Globalfoundries GF22FDX (GF22FDX) technology, coupled with a stateof-the-art (SoA) microcontroller to an embedded Field Programmable Gate Array (FPGA). We demonstrate the flexibility of the System-OnChip (SoC) to tackle the challenges of many emerging IoT applications, such as (i) interfacing sensors and accelerators with non-standard interfaces, (ii) performing on-the-fly pre-processing tasks on data streamed from peripherals, and (iii) accelerating near-sensor analytics, encryption, and machine learning tasks. A unique feature of the proposed SoC is the exploitation of body-biasing to reduce leakage power of the embedded FPGA (eFPGA) fabric by up to 18x at 0.5 V, achieving SoA state bitstream-retentive sleep power for the eFPGA fabric, as low as 20.5 uW. The proposed SoC provides 3.4x better performance and 2.9x better energy efficiency than other fabricated heterogeneous re-configurable SoCs of the same class.
△ Less
Submitted 25 June, 2020;
originally announced June 2020.
-
Multi-Path Region Mining For Weakly Supervised 3D Semantic Segmentation on Point Clouds
Authors:
Jiacheng Wei,
Guosheng Lin,
Kim-Hui Yap,
Tzu-Yi Hung,
Lihua Xie
Abstract:
Point clouds provide intrinsic geometric information and surface context for scene understanding. Existing methods for point cloud segmentation require a large amount of fully labeled data. Using advanced depth sensors, collection of large scale 3D dataset is no longer a cumbersome process. However, manually producing point-level label on the large scale dataset is time and labor-intensive. In thi…
▽ More
Point clouds provide intrinsic geometric information and surface context for scene understanding. Existing methods for point cloud segmentation require a large amount of fully labeled data. Using advanced depth sensors, collection of large scale 3D dataset is no longer a cumbersome process. However, manually producing point-level label on the large scale dataset is time and labor-intensive. In this paper, we propose a weakly supervised approach to predict point-level results using weak labels on 3D point clouds. We introduce our multi-path region mining module to generate pseudo point-level label from a classification network trained with weak labels. It mines the localization cues for each class from various aspects of the network feature using different attention modules. Then, we use the point-level pseudo labels to train a point cloud segmentation network in a fully supervised manner. To the best of our knowledge, this is the first method that uses cloud-level weak labels on raw 3D space to train a point cloud semantic segmentation network. In our setting, the 3D weak labels only indicate the classes that appeared in our input sample. We discuss both scene- and subcloud-level weakly labels on raw 3D point cloud data and perform in-depth experiments on them. On ScanNet dataset, our result trained with subcloud-level labels is compatible with some fully supervised methods.
△ Less
Submitted 29 March, 2020;
originally announced March 2020.
-
On mu-Symmetric Polynomials
Authors:
Jing Yang,
Chee K. Yap
Abstract:
In this paper, we study functions of the roots of a univariate polynomial in which the roots have a given multiplicity structure $μ$. Traditionally, root functions are studied via the theory of symmetric polynomials; we extend this theory to $μ$-symmetric polynomials. We were motivated by a conjecture from Becker et al.~(ISSAC 2016) about the $μ$-symmetry of a particular root function $D^+(μ)$, ca…
▽ More
In this paper, we study functions of the roots of a univariate polynomial in which the roots have a given multiplicity structure $μ$. Traditionally, root functions are studied via the theory of symmetric polynomials; we extend this theory to $μ$-symmetric polynomials. We were motivated by a conjecture from Becker et al.~(ISSAC 2016) about the $μ$-symmetry of a particular root function $D^+(μ)$, called D-plus. To investigate this conjecture, it was desirable to have fast algorithms for checking if a given root function is $μ$-symmetric. We designed three such algorithms: one based on Gröbner bases, another based on preprocessing and reduction, and the third based on solving linear equations. We implemented them in Maple and experiments show that the latter two algorithms are significantly faster than the first.
△ Less
Submitted 21 January, 2020;
originally announced January 2020.
-
AANet: Attribute Attention Network for Person Re-Identifications
Authors:
Chiat-Pin Tay,
Sharmili Roy,
Kim-Hui Yap
Abstract:
This paper proposes Attribute Attention Network (AANet), a new architecture that integrates person attributes and attribute attention maps into a classification framework to solve the person re-identification (re-ID) problem. Many person re-ID models typically employ semantic cues such as body parts or human pose to improve the re-ID performance. Attribute information, however, is often not utiliz…
▽ More
This paper proposes Attribute Attention Network (AANet), a new architecture that integrates person attributes and attribute attention maps into a classification framework to solve the person re-identification (re-ID) problem. Many person re-ID models typically employ semantic cues such as body parts or human pose to improve the re-ID performance. Attribute information, however, is often not utilized. The proposed AANet leverages on a baseline model that uses body parts and integrates the key attribute information in an unified learning framework. The AANet consists of a global person ID task, a part detection task and a crucial attribute detection task. By estimating the class responses of individual attributes and combining them to form the attribute attention map (AAM), a very strong discriminatory representation is constructed. The proposed AANet outperforms the best state-of-the-art method arXiv:1711.09349v3 [cs.CV] using ResNet-50 by 3.36% in mAP and 3.12% in Rank-1 accuracy on DukeMTMC-reID dataset. On Market1501 dataset, AANet achieves 92.38% mAP and 95.10% Rank-1 accuracy with re-ranking, outperforming arXiv:1804.00216v1 [cs.CV], another state of the art method using ResNet-152, by 1.42% in mAP and 0.47% in Rank-1 accuracy. In addition, AANet can perform person attribute prediction (e.g., gender, hair length, clothing length etc.), and localize the attributes in the query image.
△ Less
Submitted 19 December, 2019;
originally announced December 2019.
-
Semantic Granularity Metric Learning for Visual Search
Authors:
Dipu Manandhar,
Muhammet Bastan,
Kim-Hui Yap
Abstract:
Deep metric learning applied to various applications has shown promising results in identification, retrieval and recognition. Existing methods often do not consider different granularity in visual similarity. However, in many domain applications, images exhibit similarity at multiple granularities with visual semantic concepts, e.g. fashion demonstrates similarity ranging from clothing of the exa…
▽ More
Deep metric learning applied to various applications has shown promising results in identification, retrieval and recognition. Existing methods often do not consider different granularity in visual similarity. However, in many domain applications, images exhibit similarity at multiple granularities with visual semantic concepts, e.g. fashion demonstrates similarity ranging from clothing of the exact same instance to similar looks/design or a common category. Therefore, training image triplets/pairs used for metric learning inherently possess different degree of information. However, the existing methods often treats them with equal importance during training. This hinders capturing the underlying granularities in feature similarity required for effective visual search.
In view of this, we propose a new deep semantic granularity metric learning (SGML) that develops a novel idea of leveraging attribute semantic space to capture different granularity of similarity, and then integrate this information into deep metric learning. The proposed method simultaneously learns image attributes and embeddings using multitask CNNs. The two tasks are not only jointly optimized but are further linked by the semantic granularity similarity mappings to leverage the correlations between the tasks. To this end, we propose a new soft-binomial deviance loss that effectively integrates the degree of information in training samples, which helps to capture visual similarity at multiple granularities. Compared to recent ensemble-based methods, our framework is conceptually elegant, computationally simple and provides better performance. We perform extensive experiments on benchmark metric learning datasets and demonstrate that our method outperforms recent state-of-the-art methods, e.g., 1-4.5\% improvement in Recall@1 over the previous state-of-the-arts [1],[2] on DeepFashion In-Shop dataset.
△ Less
Submitted 14 November, 2019;
originally announced November 2019.
-
Venn GAN: Discovering Commonalities and Particularities of Multiple Distributions
Authors:
Yasin Yazıcı,
Bruno Lecouat,
Chuan-Sheng Foo,
Stefan Winkler,
Kim-Hui Yap,
Georgios Piliouras,
Vijay Chandrasekhar
Abstract:
We propose a GAN design which models multiple distributions effectively and discovers their commonalities and particularities. Each data distribution is modeled with a mixture of $K$ generator distributions. As the generators are partially shared between the modeling of different true data distributions, shared ones captures the commonality of the distributions, while non-shared ones capture uniqu…
▽ More
We propose a GAN design which models multiple distributions effectively and discovers their commonalities and particularities. Each data distribution is modeled with a mixture of $K$ generator distributions. As the generators are partially shared between the modeling of different true data distributions, shared ones captures the commonality of the distributions, while non-shared ones capture unique aspects of them. We show the effectiveness of our method on various datasets (MNIST, Fashion MNIST, CIFAR-10, Omniglot, CelebA) with compelling results.
△ Less
Submitted 9 February, 2019;
originally announced February 2019.
-
Interest Point Detection based on Adaptive Ternary Coding
Authors:
Zhenwei Miao,
Kim-Hui Yap,
Xudong Jiang
Abstract:
In this paper, an adaptive pixel ternary coding mechanism is proposed and a contrast invariant and noise resistant interest point detector is developed on the basis of this mechanism. Every pixel in a local region is adaptively encoded into one of the three statuses: bright, uncertain and dark. The blob significance of the local region is measured by the spatial distribution of the bright and dark…
▽ More
In this paper, an adaptive pixel ternary coding mechanism is proposed and a contrast invariant and noise resistant interest point detector is developed on the basis of this mechanism. Every pixel in a local region is adaptively encoded into one of the three statuses: bright, uncertain and dark. The blob significance of the local region is measured by the spatial distribution of the bright and dark pixels. Interest points are extracted from this blob significance measurement. By labeling the statuses of ternary bright, uncertain, and dark, the proposed detector shows more robustness to image noise and quantization errors. Moreover, the adaptive strategy for the ternary cording, which relies on two thresholds that automatically converge to the median of the local region in measurement, enables this coding to be insensitive to the image local contrast. As a result, the proposed detector is invariant to illumination changes. The state-of-the-art results are achieved on the standard datasets, and also in the face recognition application.
△ Less
Submitted 31 December, 2018;
originally announced January 2019.
-
DCI: Discriminative and Contrast Invertible Descriptor
Authors:
Zhenwei Miao,
Kim-Hui Yap,
Xudong Jiang,
Subbhuraam Sinduja,
Zhenhua Wang
Abstract:
Local feature descriptors have been widely used in fine-grained visual object search thanks to their robustness in scale and rotation variation and cluttered background. However, the performance of such descriptors drops under severe illumination changes. In this paper, we proposed a Discriminative and Contrast Invertible (DCI) local feature descriptor. In order to increase the discriminative abil…
▽ More
Local feature descriptors have been widely used in fine-grained visual object search thanks to their robustness in scale and rotation variation and cluttered background. However, the performance of such descriptors drops under severe illumination changes. In this paper, we proposed a Discriminative and Contrast Invertible (DCI) local feature descriptor. In order to increase the discriminative ability of the descriptor under illumination changes, a Laplace gradient based histogram is proposed. A robust contrast flipping estimate is proposed based on the divergence of a local region. Experiments on fine-grained object recognition and retrieval applications demonstrate the superior performance of DCI descriptor to others.
△ Less
Submitted 31 December, 2018;
originally announced January 2019.
-
The Unusual Effectiveness of Averaging in GAN Training
Authors:
Yasin Yazıcı,
Chuan-Sheng Foo,
Stefan Winkler,
Kim-Hui Yap,
Georgios Piliouras,
Vijay Chandrasekhar
Abstract:
We examine two different techniques for parameter averaging in GAN training. Moving Average (MA) computes the time-average of parameters, whereas Exponential Moving Average (EMA) computes an exponentially discounted sum. Whilst MA is known to lead to convergence in bilinear settings, we provide the -- to our knowledge -- first theoretical arguments in support of EMA. We show that EMA converges to…
▽ More
We examine two different techniques for parameter averaging in GAN training. Moving Average (MA) computes the time-average of parameters, whereas Exponential Moving Average (EMA) computes an exponentially discounted sum. Whilst MA is known to lead to convergence in bilinear settings, we provide the -- to our knowledge -- first theoretical arguments in support of EMA. We show that EMA converges to limit cycles around the equilibrium with vanishing amplitude as the discount parameter approaches one for simple bilinear games and also enhances the stability of general GAN training. We establish experimentally that both techniques are strikingly effective in the non-convex-concave GAN setting as well. Both improve inception and FID scores on different architectures and for different GAN objectives. We provide comprehensive experimental results across a range of datasets -- mixture of Gaussians, CIFAR-10, STL-10, CelebA and ImageNet -- to demonstrate its effectiveness. We achieve state-of-the-art results on CIFAR-10 and produce clean CelebA face images.\footnote{~The code is available at \url{https://github.com/yasinyazici/EMA_GAN}}
△ Less
Submitted 26 February, 2019; v1 submitted 12 June, 2018;
originally announced June 2018.
-
Remote Detection of Idling Cars Using Infrared Imaging and Deep Networks
Authors:
Muhammet Bastan,
Kim-Hui Yap,
Lap-Pui Chau
Abstract:
Idling vehicles waste energy and pollute the environment through exhaust emission. In some countries, idling a vehicle for more than a predefined duration is prohibited and automatic idling vehicle detection is desirable for law enforcement. We propose the first automatic system to detect idling cars, using infrared (IR) imaging and deep networks.
We rely on the differences in spatio-temporal he…
▽ More
Idling vehicles waste energy and pollute the environment through exhaust emission. In some countries, idling a vehicle for more than a predefined duration is prohibited and automatic idling vehicle detection is desirable for law enforcement. We propose the first automatic system to detect idling cars, using infrared (IR) imaging and deep networks.
We rely on the differences in spatio-temporal heat signatures of idling and stopped cars and monitor the car temperature with a long-wavelength IR camera. We formulate the idling car detection problem as spatio-temporal event detection in IR image sequences and employ deep networks for spatio-temporal modeling. We collected the first IR image sequence dataset for idling car detection. First, we detect the cars in each IR image using a convolutional neural network, which is pre-trained on regular RGB images and fine-tuned on IR images for higher accuracy. Then, we track the detected cars over time to identify the cars that are parked. Finally, we use the 3D spatio-temporal IR image volume of each parked car as input to convolutional and recurrent networks to classify them as idling or not. We carried out an extensive empirical evaluation of temporal and spatio-temporal modeling approaches with various convolutional and recurrent architectures. We present promising experimental results on our IR image sequence dataset.
△ Less
Submitted 28 April, 2018;
originally announced April 2018.
-
Handling state space explosion in verification of component-based systems: A review
Authors:
Faranak Nejati,
Abdul Azim Abd. Ghani,
Ng Keng Yap,
Azmi Jaafar
Abstract:
Component-based software development (CBSD) is an alternative approach to constructing software systems that offers numerous benefits, particularly in decreasing the complexity of system design. However, deploying components into a system is a challenging and error-prone task. Model-checking is one of the reliable methods to systematically analyze the correctness of a system. It is a bruce-force c…
▽ More
Component-based software development (CBSD) is an alternative approach to constructing software systems that offers numerous benefits, particularly in decreasing the complexity of system design. However, deploying components into a system is a challenging and error-prone task. Model-checking is one of the reliable methods to systematically analyze the correctness of a system. It is a bruce-force checking of the system's state space that assists to significantly expand the level of confidence in the system. Nevertheless, model-checking is limited by a critical problem called state-space explosion (SSE). To benefit from model-checking, an appropriate method is required to reduce SSE. In the past two decades, a great number of SSE reduction methods have been proposed containing many similarities, dissimilarities, and unclear concepts in some cases. This research, firstly, plans to present a review of SSE handling methods and classify them based on their similarities, principle, and characteristics. Second, it investigates the methods for handling the SSE problem in the verification process of CBSD and provides insight into the potential limitations, underlining the key challenges for future research efforts.
△ Less
Submitted 26 May, 2021; v1 submitted 28 July, 2017;
originally announced September 2017.
-
Resolution-Exact Planner for Thick Non-Crossing 2-Link Robots
Authors:
Chee K. Yap,
Zhongdi Luo,
Ching-Hsiang Hsu
Abstract:
We consider the path planning problem for a 2-link robot amidst polygonal obstacles. Our robot is parametrizable by the lengths $\ell_1, \ell_2>0$ of its two links, the thickness $τ\ge 0$ of the links, and an angle $κ$ that constrains the angle between the 2 links to be strictly greater than $κ$. The case $τ>0$ and $κ\ge 0$ corresponds to "thick non-crossing" robots. This results in a novel 4DOF c…
▽ More
We consider the path planning problem for a 2-link robot amidst polygonal obstacles. Our robot is parametrizable by the lengths $\ell_1, \ell_2>0$ of its two links, the thickness $τ\ge 0$ of the links, and an angle $κ$ that constrains the angle between the 2 links to be strictly greater than $κ$. The case $τ>0$ and $κ\ge 0$ corresponds to "thick non-crossing" robots. This results in a novel 4DOF configuration space ${\mathbb R}^2\times ({\mathbb T}\setminusΔ(κ))$ where ${\mathbb T}$ is the torus and $Δ(κ)$ the diagonal band of width $κ$. We design a resolution-exact planner for this robot using the framework of Soft Subdivision Search (SSS). First, we provide an analysis of the space of forbidden angles, leading to a soft predicate for classifying configuration boxes. We further exploit the T/R splitting technique which was previously introduced for self-crossing thin 2-link robots. Our open-source implementation in Core Library achieves real-time performance for a suite of combinatorially non-trivial obstacle sets. Experimentally, our algorithm is significantly better than any of the state-of-art sampling algorithms we looked at, in timing and in success rate.
△ Less
Submitted 17 April, 2017;
originally announced April 2017.
-
Certified Computation of planar Morse-Smale Complexes
Authors:
Amit Chattopadhyay,
Gert Vegter,
Chee K. Yap
Abstract:
The Morse-Smale complex is an important tool for global topological analysis in various problems of computational geometry and topology. Algorithms for Morse-Smale complexes have been presented in case of piecewise linear manifolds. However, previous research in this field is incomplete in the case of smooth functions. In the current paper we address the following question: Given an arbitrarily co…
▽ More
The Morse-Smale complex is an important tool for global topological analysis in various problems of computational geometry and topology. Algorithms for Morse-Smale complexes have been presented in case of piecewise linear manifolds. However, previous research in this field is incomplete in the case of smooth functions. In the current paper we address the following question: Given an arbitrarily complex Morse-Smale system on a planar domain, is it possible to compute its certified (topologically correct) Morse-Smale complex? Towards this, we develop an algorithm using interval arithmetic to compute certified critical points and separatrices forming the Morse-Smale complexes of smooth functions on bounded planar domain. Our algorithm can also compute geometrically close Morse-Smale complexes.
△ Less
Submitted 20 June, 2015;
originally announced June 2015.