-
EMG-Driven Stiffness-Modulating Palpation for Telerehabilitation
Authors:
Thomas M. Kwok,
Hilary HY Cheng,
Wai Tuck Chow
Abstract:
In this work, we introduce HJ-Pal, a lightweight wearable haptic device that leverages EMG-driven honeycomb jamming to render muscle activation as kinesthetic feedback, enabling remote palpation for small muscle assessment in telerehabilitation.
In this work, we introduce HJ-Pal, a lightweight wearable haptic device that leverages EMG-driven honeycomb jamming to render muscle activation as kinesthetic feedback, enabling remote palpation for small muscle assessment in telerehabilitation.
△ Less
Submitted 9 June, 2025;
originally announced June 2025.
-
Time-Dependent Precision Measurement of $B_s^0\rightarrow φμ^+μ^-$ Decay at FCC-$ee$
Authors:
Tsz Hong Kwok,
Zachary Polonsky,
Valeriia Lukashenko,
Jason Aebischer,
Ben Kilminster
Abstract:
We study the feasibility of measuring time-dependent $C\!P$ violation in the rare flavor-changing neutral current (FCNC) decay $B_s^0 \rightarrow φ(\rightarrow K^+K^-) μ^+ μ^-$ at the FCC-$ee$. In the Standard Model (SM), $C\!P$ violation in this mode arises only at higher orders and is highly suppressed. Extensions of the SM, collectively referred to as New Physics (NP), can introduce additional…
▽ More
We study the feasibility of measuring time-dependent $C\!P$ violation in the rare flavor-changing neutral current (FCNC) decay $B_s^0 \rightarrow φ(\rightarrow K^+K^-) μ^+ μ^-$ at the FCC-$ee$. In the Standard Model (SM), $C\!P$ violation in this mode arises only at higher orders and is highly suppressed. Extensions of the SM, collectively referred to as New Physics (NP), can introduce additional $C\!P$-violating phases that enhance such effects. The decay $B_s^0 \rightarrow φμ^+ μ^-$, mediated by the $b \rightarrow s \ell^+ \ell^-$ transition, is therefore a promising probe of NP. The FCC-$ee$, operating as a high-luminosity $Z$-factory, offers an optimal environment for this measurement due to its large event yield, clean conditions, efficient particle identification, and excellent vertex resolution. We perform a Monte Carlo study using Pythia and Delphes with the IDEA detector concept. A relative precision better than $\mathcal{O}(1\%)$ on the branching ratio and $\mathcal{O}(10^{-2})$ on the time-integrated $C\!P$ asymmetry is found to be achievable. We determine the projected sensitivities to the observables $D_f$, $C_f$, and $S_f$, which parameterize time-dependent $C\!P$ violation. In the untagged analysis, a precision of $\mathcal{O}(10^{-1})$ on $D_f$ can be reached. With flavor tagging, sensitivities to $C_f$ and $S_f$ improve to $\mathcal{O}(10^{-2})$. These measurements remain inaccessible to current flavor experiments. Interpreting the results within the Weak Effective Theory provides model-independent constraints on $C\!P$-violating NP. This study demonstrates that FCC-$ee$ enables first-time access to $C\!P$-sensitive observables previously beyond experimental reach.
△ Less
Submitted 9 June, 2025;
originally announced June 2025.
-
DynamicMind: A Tri-Mode Thinking System for Large Language Models
Authors:
Wei Li,
Yanbin Wei,
Qiushi Huang,
Jiangyue Yan,
Yang Chen,
James T. Kwok,
Yu Zhang
Abstract:
Modern large language models (LLMs) often struggle to dynamically adapt their reasoning depth to varying task complexities, leading to suboptimal performance or inefficient resource utilization. To address this, we introduce DynamicMind, a novel tri-mode thinking system. DynamicMind empowers LLMs to autonomously select between Fast, Normal, and Slow thinking modes for zero-shot question answering…
▽ More
Modern large language models (LLMs) often struggle to dynamically adapt their reasoning depth to varying task complexities, leading to suboptimal performance or inefficient resource utilization. To address this, we introduce DynamicMind, a novel tri-mode thinking system. DynamicMind empowers LLMs to autonomously select between Fast, Normal, and Slow thinking modes for zero-shot question answering (ZSQA) tasks through cognitive-inspired prompt engineering. Our framework's core innovations include: (1) expanding the established dual-process framework of fast and slow thinking into a tri-mode thinking system involving a normal thinking mode to preserve the intrinsic capabilities of LLM; (2) proposing the Thinking Density metric, which aligns computational resource allocation with problem complexity; and (3) developing the Thinking Mode Capacity (TMC) dataset and a lightweight Mind Router to predict the optimal thinking mode. Extensive experiments across diverse mathematical, commonsense, and scientific QA benchmarks demonstrate that DynamicMind achieves superior ZSQA capabilities while establishing an effective trade-off between performance and computational efficiency.
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning
Authors:
Yunhao Gou,
Kai Chen,
Zhili Liu,
Lanqing Hong,
Xin Jin,
Zhenguo Li,
James T. Kwok,
Yu Zhang
Abstract:
Recent advances in slow-thinking language models (e.g., OpenAI-o1 and DeepSeek-R1) have demonstrated remarkable abilities in complex reasoning tasks by emulating human-like reflective cognition. However, extending such capabilities to multi-modal large language models (MLLMs) remains challenging due to the high cost of retraining vision-language alignments when upgrading the underlying reasoner LL…
▽ More
Recent advances in slow-thinking language models (e.g., OpenAI-o1 and DeepSeek-R1) have demonstrated remarkable abilities in complex reasoning tasks by emulating human-like reflective cognition. However, extending such capabilities to multi-modal large language models (MLLMs) remains challenging due to the high cost of retraining vision-language alignments when upgrading the underlying reasoner LLMs. A straightforward solution is to decouple perception from reasoning, i.e., converting visual inputs into language representations (e.g., captions) that are then passed to a powerful text-only reasoner. However, this decoupling introduces a critical challenge: the visual extractor must generate descriptions that are both faithful to the image and informative enough to support accurate downstream reasoning. To address this, we propose Reasoning-Aligned Perceptual Decoupling via Caption Reward Optimization (RACRO) - a reasoning-guided reinforcement learning strategy that aligns the extractor's captioning behavior with the reasoning objective. By closing the perception-reasoning loop via reward-based optimization, RACRO significantly enhances visual grounding and extracts reasoning-optimized representations. Experiments on multi-modal math and science benchmarks show that the proposed RACRO method achieves state-of-the-art average performance while enabling superior scalability and plug-and-play adaptation to more advanced reasoning LLMs without the necessity for costly multi-modal re-alignment.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
Multi-Order Wavelet Derivative Transform for Deep Time Series Forecasting
Authors:
Ziyu Zhou,
Jiaxi Hu,
Qingsong Wen,
James T. Kwok,
Yuxuan Liang
Abstract:
In deep time series forecasting, the Fourier Transform (FT) is extensively employed for frequency representation learning. However, it often struggles in capturing multi-scale, time-sensitive patterns. Although the Wavelet Transform (WT) can capture these patterns through frequency decomposition, its coefficients are insensitive to change points in time series, leading to suboptimal modeling. To m…
▽ More
In deep time series forecasting, the Fourier Transform (FT) is extensively employed for frequency representation learning. However, it often struggles in capturing multi-scale, time-sensitive patterns. Although the Wavelet Transform (WT) can capture these patterns through frequency decomposition, its coefficients are insensitive to change points in time series, leading to suboptimal modeling. To mitigate these limitations, we introduce the multi-order Wavelet Derivative Transform (WDT) grounded in the WT, enabling the extraction of time-aware patterns spanning both the overall trend and subtle fluctuations. Compared with the standard FT and WT, which model the raw series, the WDT operates on the derivative of the series, selectively magnifying rate-of-change cues and exposing abrupt regime shifts that are particularly informative for time series modeling. Practically, we embed the WDT into a multi-branch framework named WaveTS, which decomposes the input series into multi-scale time-frequency coefficients, refines them via linear layers, and reconstructs them into the time domain via the inverse WDT. Extensive experiments on ten benchmark datasets demonstrate that WaveTS achieves state-of-the-art forecasting accuracy while retaining high computational efficiency.
△ Less
Submitted 16 May, 2025;
originally announced May 2025.
-
Positioning Monocular Optical See Through Head Worn Displays in Glasses for Everyday Wear
Authors:
Parth Arora,
Ethan Kimmel,
Katherine Huang,
Tyler Kwok,
Yukun Song,
Sofia Vempala,
Georgianna Lin,
Ozan Cakmakci,
Thad Starner
Abstract:
Head-worn displays for everyday wear in the form of regular eyeglasses are technically feasible with recent advances in waveguide technology. One major design decision is determining where in the user's visual field to position the display. Centering the display in the principal point of gaze (PPOG) allows the user to switch attentional focus between the virtual and real images quickly, and best p…
▽ More
Head-worn displays for everyday wear in the form of regular eyeglasses are technically feasible with recent advances in waveguide technology. One major design decision is determining where in the user's visual field to position the display. Centering the display in the principal point of gaze (PPOG) allows the user to switch attentional focus between the virtual and real images quickly, and best performance often occurs when the display is centered in PPOG or is centered vertically below PPOG. However, these positions are often undesirable in that they are considered interruptive or are associated with negative social perceptions by users. Offsetting the virtual image may be preferred when tasks involve driving, walking, or social interaction. This paper consolidates findings from recent studies on monocular optical see-through HWDs (OST-HWDs), focusing on potential for interruption, comfort, performance, and social perception. For text-based tasks, which serve as a proxy for many monocular OST-HWD tasks, we recommend a 15° horizontal field of view (FOV) with the virtual image in the right lens vertically centered but offset to +8.7° to +23.7° toward the ear. Glanceable content can be offset up to +30° for short interactions.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Future Circular Collider Feasibility Study Report: Volume 2, Accelerators, Technical Infrastructure and Safety
Authors:
M. Benedikt,
F. Zimmermann,
B. Auchmann,
W. Bartmann,
J. P. Burnet,
C. Carli,
A. Chancé,
P. Craievich,
M. Giovannozzi,
C. Grojean,
J. Gutleber,
K. Hanke,
A. Henriques,
P. Janot,
C. Lourenço,
M. Mangano,
T. Otto,
J. Poole,
S. Rajagopalan,
T. Raubenheimer,
E. Todesco,
L. Ulrici,
T. Watson,
G. Wilkinson,
A. Abada
, et al. (1439 additional authors not shown)
Abstract:
In response to the 2020 Update of the European Strategy for Particle Physics, the Future Circular Collider (FCC) Feasibility Study was launched as an international collaboration hosted by CERN. This report describes the FCC integrated programme, which consists of two stages: an electron-positron collider (FCC-ee) in the first phase, serving as a high-luminosity Higgs, top, and electroweak factory;…
▽ More
In response to the 2020 Update of the European Strategy for Particle Physics, the Future Circular Collider (FCC) Feasibility Study was launched as an international collaboration hosted by CERN. This report describes the FCC integrated programme, which consists of two stages: an electron-positron collider (FCC-ee) in the first phase, serving as a high-luminosity Higgs, top, and electroweak factory; followed by a proton-proton collider (FCC-hh) at the energy frontier in the second phase.
FCC-ee is designed to operate at four key centre-of-mass energies: the Z pole, the WW production threshold, the ZH production peak, and the top/anti-top production threshold - delivering the highest possible luminosities to four experiments. Over 15 years of operation, FCC-ee will produce more than 6 trillion Z bosons, 200 million WW pairs, nearly 3 million Higgs bosons, and 2 million top anti-top pairs. Precise energy calibration at the Z pole and WW threshold will be achieved through frequent resonant depolarisation of pilot bunches. The sequence of operation modes remains flexible.
FCC-hh will operate at a centre-of-mass energy of approximately 85 TeV - nearly an order of magnitude higher than the LHC - and is designed to deliver 5 to 10 times the integrated luminosity of the HL-LHC. Its mass reach for direct discovery extends to several tens of TeV. In addition to proton-proton collisions, FCC-hh is capable of supporting ion-ion, ion-proton, and lepton-hadron collision modes.
This second volume of the Feasibility Study Report presents the complete design of the FCC-ee collider, its operation and staging strategy, the full-energy booster and injector complex, required accelerator technologies, safety concepts, and technical infrastructure. It also includes the design of the FCC-hh hadron collider, development of high-field magnets, hadron injector options, and key technical systems for FCC-hh.
△ Less
Submitted 25 April, 2025;
originally announced May 2025.
-
Future Circular Collider Feasibility Study Report: Volume 3, Civil Engineering, Implementation and Sustainability
Authors:
M. Benedikt,
F. Zimmermann,
B. Auchmann,
W. Bartmann,
J. P. Burnet,
C. Carli,
A. Chancé,
P. Craievich,
M. Giovannozzi,
C. Grojean,
J. Gutleber,
K. Hanke,
A. Henriques,
P. Janot,
C. Lourenço,
M. Mangano,
T. Otto,
J. Poole,
S. Rajagopalan,
T. Raubenheimer,
E. Todesco,
L. Ulrici,
T. Watson,
G. Wilkinson,
P. Azzi
, et al. (1439 additional authors not shown)
Abstract:
Volume 3 of the FCC Feasibility Report presents studies related to civil engineering, the development of a project implementation scenario, and environmental and sustainability aspects. The report details the iterative improvements made to the civil engineering concepts since 2018, taking into account subsurface conditions, accelerator and experiment requirements, and territorial considerations. I…
▽ More
Volume 3 of the FCC Feasibility Report presents studies related to civil engineering, the development of a project implementation scenario, and environmental and sustainability aspects. The report details the iterative improvements made to the civil engineering concepts since 2018, taking into account subsurface conditions, accelerator and experiment requirements, and territorial considerations. It outlines a technically feasible and economically viable civil engineering configuration that serves as the baseline for detailed subsurface investigations, construction design, cost estimation, and project implementation planning. Additionally, the report highlights ongoing subsurface investigations in key areas to support the development of an improved 3D subsurface model of the region.
The report describes development of the project scenario based on the 'avoid-reduce-compensate' iterative optimisation approach. The reference scenario balances optimal physics performance with territorial compatibility, implementation risks, and costs. Environmental field investigations covering almost 600 hectares of terrain - including numerous urban, economic, social, and technical aspects - confirmed the project's technical feasibility and contributed to the preparation of essential input documents for the formal project authorisation phase. The summary also highlights the initiation of public dialogue as part of the authorisation process. The results of a comprehensive socio-economic impact assessment, which included significant environmental effects, are presented. Even under the most conservative and stringent conditions, a positive benefit-cost ratio for the FCC-ee is obtained. Finally, the report provides a concise summary of the studies conducted to document the current state of the environment.
△ Less
Submitted 25 April, 2025;
originally announced May 2025.
-
Future Circular Collider Feasibility Study Report: Volume 1, Physics, Experiments, Detectors
Authors:
M. Benedikt,
F. Zimmermann,
B. Auchmann,
W. Bartmann,
J. P. Burnet,
C. Carli,
A. Chancé,
P. Craievich,
M. Giovannozzi,
C. Grojean,
J. Gutleber,
K. Hanke,
A. Henriques,
P. Janot,
C. Lourenço,
M. Mangano,
T. Otto,
J. Poole,
S. Rajagopalan,
T. Raubenheimer,
E. Todesco,
L. Ulrici,
T. Watson,
G. Wilkinson,
P. Azzi
, et al. (1439 additional authors not shown)
Abstract:
Volume 1 of the FCC Feasibility Report presents an overview of the physics case, experimental programme, and detector concepts for the Future Circular Collider (FCC). This volume outlines how FCC would address some of the most profound open questions in particle physics, from precision studies of the Higgs and EW bosons and of the top quark, to the exploration of physics beyond the Standard Model.…
▽ More
Volume 1 of the FCC Feasibility Report presents an overview of the physics case, experimental programme, and detector concepts for the Future Circular Collider (FCC). This volume outlines how FCC would address some of the most profound open questions in particle physics, from precision studies of the Higgs and EW bosons and of the top quark, to the exploration of physics beyond the Standard Model. The report reviews the experimental opportunities offered by the staged implementation of FCC, beginning with an electron-positron collider (FCC-ee), operating at several centre-of-mass energies, followed by a hadron collider (FCC-hh). Benchmark examples are given of the expected physics performance, in terms of precision and sensitivity to new phenomena, of each collider stage. Detector requirements and conceptual designs for FCC-ee experiments are discussed, as are the specific demands that the physics programme imposes on the accelerator in the domains of the calibration of the collision energy, and the interface region between the accelerator and the detector. The report also highlights advances in detector, software and computing technologies, as well as the theoretical tools /reconstruction techniques that will enable the precision measurements and discovery potential of the FCC experimental programme. This volume reflects the outcome of a global collaborative effort involving hundreds of scientists and institutions, aided by a dedicated community-building coordination, and provides a targeted assessment of the scientific opportunities and experimental foundations of the FCC programme.
△ Less
Submitted 25 April, 2025;
originally announced May 2025.
-
A Survey on Small Sample Imbalance Problem: Metrics, Feature Analysis, and Solutions
Authors:
Shuxian Zhao,
Jie Gui,
Minjing Dong,
Baosheng Yu,
Zhipeng Gui,
Lu Dong,
Yuan Yan Tang,
James Tin-Yau Kwok
Abstract:
The small sample imbalance (S&I) problem is a major challenge in machine learning and data analysis. It is characterized by a small number of samples and an imbalanced class distribution, which leads to poor model performance. In addition, indistinct inter-class feature distributions further complicate classification tasks. Existing methods often rely on algorithmic heuristics without sufficiently…
▽ More
The small sample imbalance (S&I) problem is a major challenge in machine learning and data analysis. It is characterized by a small number of samples and an imbalanced class distribution, which leads to poor model performance. In addition, indistinct inter-class feature distributions further complicate classification tasks. Existing methods often rely on algorithmic heuristics without sufficiently analyzing the underlying data characteristics. We argue that a detailed analysis from the data perspective is essential before developing an appropriate solution. Therefore, this paper proposes a systematic analytical framework for the S\&I problem. We first summarize imbalance metrics and complexity analysis methods, highlighting the need for interpretable benchmarks to characterize S&I problems. Second, we review recent solutions for conventional, complexity-based, and extreme S&I problems, revealing methodological differences in handling various data distributions. Our summary finds that resampling remains a widely adopted solution. However, we conduct experiments on binary and multiclass datasets, revealing that classifier performance differences significantly exceed the improvements achieved through resampling. Finally, this paper highlights open questions and discusses future trends.
△ Less
Submitted 20 April, 2025;
originally announced April 2025.
-
FinSage: A Multi-aspect RAG System for Financial Filings Question Answering
Authors:
Xinyu Wang,
Jijun Chi,
Zhenghan Tai,
Tung Sum Thomas Kwok,
Muzhi Li,
Zhuhong Li,
Hailin He,
Yuchen Hua,
Peng Lu,
Suyuchen Wang,
Yihong Wu,
Jerry Huang,
Jingrui Tian,
Fengran Mo,
Yufei Cui,
Ling Zhou
Abstract:
Leveraging large language models in real-world settings often entails a need to utilize domain-specific data and tools in order to follow the complex regulations that need to be followed for acceptable use. Within financial sectors, modern enterprises increasingly rely on Retrieval-Augmented Generation (RAG) systems to address complex compliance requirements in financial document workflows. Howeve…
▽ More
Leveraging large language models in real-world settings often entails a need to utilize domain-specific data and tools in order to follow the complex regulations that need to be followed for acceptable use. Within financial sectors, modern enterprises increasingly rely on Retrieval-Augmented Generation (RAG) systems to address complex compliance requirements in financial document workflows. However, existing solutions struggle to account for the inherent heterogeneity of data (e.g., text, tables, diagrams) and evolving nature of regulatory standards used in financial filings, leading to compromised accuracy in critical information extraction. We propose the FinSage framework as a solution, utilizing a multi-aspect RAG framework tailored for regulatory compliance analysis in multi-modal financial documents. FinSage introduces three innovative components: (1) a multi-modal pre-processing pipeline that unifies diverse data formats and generates chunk-level metadata summaries, (2) a multi-path sparse-dense retrieval system augmented with query expansion (HyDE) and metadata-aware semantic search, and (3) a domain-specialized re-ranking module fine-tuned via Direct Preference Optimization (DPO) to prioritize compliance-critical content. Extensive experiments demonstrate that FinSage achieves an impressive recall of 92.51% on 75 expert-curated questions derived from surpasses the best baseline method on the FinanceBench question answering datasets by 24.06% in accuracy. Moreover, FinSage has been successfully deployed as financial question-answering agent in online meetings, where it has already served more than 1,200 people.
△ Less
Submitted 6 June, 2025; v1 submitted 20 April, 2025;
originally announced April 2025.
-
ColorVein: Colorful Cancelable Vein Biometrics
Authors:
Yifan Wang,
Jie Gui,
Xinli Shi,
Linqing Gui,
Yuan Yan Tang,
James Tin-Yau Kwok
Abstract:
Vein recognition technologies have become one of the primary solutions for high-security identification systems. However, the issue of biometric information leakage can still pose a serious threat to user privacy and anonymity. Currently, there is no cancelable biometric template generation scheme specifically designed for vein biometrics. Therefore, this paper proposes an innovative cancelable ve…
▽ More
Vein recognition technologies have become one of the primary solutions for high-security identification systems. However, the issue of biometric information leakage can still pose a serious threat to user privacy and anonymity. Currently, there is no cancelable biometric template generation scheme specifically designed for vein biometrics. Therefore, this paper proposes an innovative cancelable vein biometric generation scheme: ColorVein. Unlike previous cancelable template generation schemes, ColorVein does not destroy the original biometric features and introduces additional color information to grayscale vein images. This method significantly enhances the information density of vein images by transforming static grayscale information into dynamically controllable color representations through interactive colorization. ColorVein allows users/administrators to define a controllable pseudo-random color space for grayscale vein images by editing the position, number, and color of hint points, thereby generating protected cancelable templates. Additionally, we propose a new secure center loss to optimize the training process of the protected feature extraction model, effectively increasing the feature distance between enrolled users and any potential impostors. Finally, we evaluate ColorVein's performance on all types of vein biometrics, including recognition performance, unlinkability, irreversibility, and revocability, and conduct security and privacy analyses. ColorVein achieves competitive performance compared with state-of-the-art methods.
△ Less
Submitted 19 April, 2025;
originally announced April 2025.
-
Leveraging GCN-based Action Recognition for Teleoperation in Daily Activity Assistance
Authors:
Thomas M. Kwok,
Jiaan Li,
Yue Hu
Abstract:
Caregiving of older adults is an urgent global challenge, with many older adults preferring to age in place rather than enter residential care. However, providing adequate home-based assistance remains difficult, particularly in geographically vast regions. Teleoperated robots offer a promising solution, but conventional motion-mapping teleoperation imposes unnatural movement constraints on operat…
▽ More
Caregiving of older adults is an urgent global challenge, with many older adults preferring to age in place rather than enter residential care. However, providing adequate home-based assistance remains difficult, particularly in geographically vast regions. Teleoperated robots offer a promising solution, but conventional motion-mapping teleoperation imposes unnatural movement constraints on operators, leading to muscle fatigue and reduced usability. This paper presents a novel teleoperation framework that leverages action recognition to enable intuitive remote robot control. Using our simplified Spatio-Temporal Graph Convolutional Network (S-ST-GCN), the system recognizes human actions and executes corresponding preset robot trajectories, eliminating the need for direct motion synchronization. A finite-state machine (FSM) is integrated to enhance reliability by filtering out misclassified actions. Our experiments demonstrate that the proposed framework enables effortless operator movement while ensuring accurate robot execution. This proof-of-concept study highlights the potential of teleoperation with action recognition for enabling caregivers to remotely assist older adults during activities of daily living (ADLs). Future work will focus on improving the S-ST-GCN's recognition accuracy and generalization, integrating advanced motion planning techniques to further enhance robotic autonomy in older adult care, and conducting a user study to evaluate the system's telepresence and ease of control.
△ Less
Submitted 9 April, 2025;
originally announced April 2025.
-
GReaTER: Generate Realistic Tabular data after data Enhancement and Reduction
Authors:
Tung Sum Thomas Kwok,
Chi-Hua Wang,
Guang Cheng
Abstract:
Tabular data synthesis involves not only multi-table synthesis but also generating multi-modal data (e.g., strings and categories), which enables diverse knowledge synthesis. However, separating numerical and categorical data has limited the effectiveness of tabular data generation. The GReaT (Generate Realistic Tabular Data) framework uses Large Language Models (LLMs) to encode entire rows, elimi…
▽ More
Tabular data synthesis involves not only multi-table synthesis but also generating multi-modal data (e.g., strings and categories), which enables diverse knowledge synthesis. However, separating numerical and categorical data has limited the effectiveness of tabular data generation. The GReaT (Generate Realistic Tabular Data) framework uses Large Language Models (LLMs) to encode entire rows, eliminating the need to partition data types. Despite this, the framework's performance is constrained by two issues: (1) tabular data entries lack sufficient semantic meaning, limiting LLM's ability to leverage pre-trained knowledge for in-context learning, and (2) complex multi-table datasets struggle to establish effective relationships for collaboration. To address these, we propose GReaTER (Generate Realistic Tabular Data after data Enhancement and Reduction), which includes: (1) a data semantic enhancement system that improves LLM's understanding of tabular data through mapping, enabling better in-context learning, and (2) a cross-table connecting method to establish efficient relationships across complex tables. Experimental results show that GReaTER outperforms the GReaT framework.
△ Less
Submitted 19 March, 2025;
originally announced March 2025.
-
Robot Character Generation and Adaptive Human-Robot Interaction with Personality Shaping
Authors:
Cheng Tang,
Chao Tang,
Steven Gong,
Thomas M. Kwok,
Yue Hu
Abstract:
We present a novel framework for designing emotionally agile robots with dynamic personalities and memory-based learning, with the aim of performing adaptive and non-deterministic interactions with humans while conforming to shared social understanding. While existing work has largely focused on emotion recognition and static response systems, many approaches rely on sentiment analysis and action…
▽ More
We present a novel framework for designing emotionally agile robots with dynamic personalities and memory-based learning, with the aim of performing adaptive and non-deterministic interactions with humans while conforming to shared social understanding. While existing work has largely focused on emotion recognition and static response systems, many approaches rely on sentiment analysis and action mapping frameworks that are pre-defined with limited dimensionality and fixed configurations, lacking the flexibility of dynamic personality traits and memory-enabled adaptation. Other systems are often restricted to limited modes of expression and fail to develop a causal relationship between human behavior and the robot's proactive physical actions, resulting in constrained adaptability and reduced responsiveness in complex, dynamic interactions. Our methodology integrates the Big Five Personality Traits, Appraisal Theory, and abstracted memory layers through Large Language Models (LLMs). The LLM generates a parameterized robot personality based on the Big Five, processes human language and sentiments, evaluates human behavior using Appraisal Theory, and generates emotions and selects appropriate actions adapted by historical context over time. We validated the framework by testing three robots with distinct personalities in identical background contexts and found that personality, appraisal, and memory influence the adaptability of human-robot interactions. The impact of the individual components was further validated through ablation tests. We conclude that this system enables robots to engage in meaningful and personalized interactions with users, and holds significant potential for applications in domains such as pet robots, assistive robots, educational robots, and collaborative functional robots, where cultivating tailored relationships and enriching user experiences are essential.
△ Less
Submitted 21 March, 2025; v1 submitted 2 February, 2025;
originally announced March 2025.
-
A Practical Sensing Interface for Exoskeleton Evaluation in Workplaces using Interface Forces
Authors:
Joshua Leong Wei Ren,
Thomas M. Kwok
Abstract:
This paper presents a novel approach to evaluating back support exoskeletons (BSEs) in workplace settings addressing the limitations of traditional methods like electromyography (EMG), which are impractical due to their sensitivity to external disturbances and user sweat. Variability in BSE performance among users, often due to joint misalignment and anthropomorphic differences, can lead to discom…
▽ More
This paper presents a novel approach to evaluating back support exoskeletons (BSEs) in workplace settings addressing the limitations of traditional methods like electromyography (EMG), which are impractical due to their sensitivity to external disturbances and user sweat. Variability in BSE performance among users, often due to joint misalignment and anthropomorphic differences, can lead to discomfort and reduced effectiveness. To overcome these challenges, we propose integrating a compact load cell into the exoskeleton's thigh cuff. This small load cell provides precise force measurements without significantly altering the exoskeleton's kinematics or inertia, enabling real-time assessment of exoskeleton assistance in both laboratory and workplace environments, Experimental validation during load-lifting tasks demonstrated that the load cell effectively captures interface forces between the BSE and human subjects, showing stronger correlations with the user's muscle activity when the BSE provides effective assistance. This innovative sensing interface offers a stable, practical alternative to EMG and respiratory gas measurements, facilitating more accurate and convenient evaluation of BSE performance in real-world industrial and laboratory settings. The proposed method holds promise for enhancing the adoption and effectiveness of BSEs by providing reliable, real-time feedback on their assistance capabilities.
△ Less
Submitted 28 February, 2025;
originally announced March 2025.
-
Corrupted but Not Broken: Understanding and Mitigating the Negative Impacts of Corrupted Data in Visual Instruction Tuning
Authors:
Yunhao Gou,
Hansi Yang,
Zhili Liu,
Kai Chen,
Yihan Zeng,
Lanqing Hong,
Zhenguo Li,
Qun Liu,
Bo Han,
James T. Kwok,
Yu Zhang
Abstract:
Visual Instruction Tuning (VIT) aims to enhance Multimodal Large Language Models (MLLMs), yet its effectiveness is often compromised by corrupted datasets with issues such as hallucinated content, incorrect responses, and poor OCR quality. Previous approaches to address these challenges have focused on refining datasets through high-quality data collection or rule-based filtering that can be costl…
▽ More
Visual Instruction Tuning (VIT) aims to enhance Multimodal Large Language Models (MLLMs), yet its effectiveness is often compromised by corrupted datasets with issues such as hallucinated content, incorrect responses, and poor OCR quality. Previous approaches to address these challenges have focused on refining datasets through high-quality data collection or rule-based filtering that can be costly or limited in scope. In this paper, we conduct a systematic investigation into the impact of corrupted data on MLLMs and discover that, although corrupted data degrade model performance, such adverse effects are largely reversible, and MLLMs are {\bf corrupted but not broken}. Specifically, we find that disabling a small subset of parameters can almost fully restore performance. Moreover, corrupted MLLMs inherently possess the capability to differentiate between clean and corrupted samples, facilitating dataset cleaning without external intervention. Building on these insights, we introduce a corruption-robust training paradigm that significantly surpasses existing strategies for mitigating the effects of corrupted data.
△ Less
Submitted 27 May, 2025; v1 submitted 18 February, 2025;
originally announced February 2025.
-
Gradient-Based Multi-Objective Deep Learning: Algorithms, Theories, Applications, and Beyond
Authors:
Weiyu Chen,
Xiaoyuan Zhang,
Baijiong Lin,
Xi Lin,
Han Zhao,
Qingfu Zhang,
James T. Kwok
Abstract:
Multi-objective optimization (MOO) in deep learning aims to simultaneously optimize multiple conflicting objectives, a challenge frequently encountered in areas like multi-task learning and multi-criteria learning. Recent advancements in gradient-based MOO methods have enabled the discovery of diverse types of solutions, ranging from a single balanced solution to finite or even infinite Pareto set…
▽ More
Multi-objective optimization (MOO) in deep learning aims to simultaneously optimize multiple conflicting objectives, a challenge frequently encountered in areas like multi-task learning and multi-criteria learning. Recent advancements in gradient-based MOO methods have enabled the discovery of diverse types of solutions, ranging from a single balanced solution to finite or even infinite Pareto sets, tailored to user needs. These developments have broad applications across domains such as reinforcement learning, computer vision, recommendation systems, and large language models. This survey provides the first comprehensive review of gradient-based MOO in deep learning, covering algorithms, theories, and practical applications. By unifying various approaches and identifying critical challenges, it serves as a foundational resource for driving innovation in this evolving field. A comprehensive list of MOO algorithms in deep learning is available at https://github.com/Baijiong-Lin/Awesome-Multi-Objective-Deep-Learning.
△ Less
Submitted 3 March, 2025; v1 submitted 18 January, 2025;
originally announced January 2025.
-
Detecting Neurocognitive Disorders through Analyses of Topic Evolution and Cross-modal Consistency in Visual-Stimulated Narratives
Authors:
Jinchao Li,
Yuejiao Wang,
Junan Li,
Jiawen Kang,
Bo Zheng,
Simon Wong,
Brian Mak,
Helene Fung,
Jean Woo,
Man-Wai Mak,
Timothy Kwok,
Vincent Mok,
Xianmin Gong,
Xixin Wu,
Xunying Liu,
Patrick Wong,
Helen Meng
Abstract:
Early detection of neurocognitive disorders (NCDs) is crucial for timely intervention and disease management. Speech analysis offers a non-intrusive and scalable screening method, particularly through narrative tasks in neuropsychological assessment tools. Traditional narrative analysis often focuses on local indicators in microstructure, such as word usage and syntax. While these features provide…
▽ More
Early detection of neurocognitive disorders (NCDs) is crucial for timely intervention and disease management. Speech analysis offers a non-intrusive and scalable screening method, particularly through narrative tasks in neuropsychological assessment tools. Traditional narrative analysis often focuses on local indicators in microstructure, such as word usage and syntax. While these features provide insights into language production abilities, they often fail to capture global narrative patterns, or microstructures. Macrostructures include coherence, thematic organization, and logical progressions, reflecting essential cognitive skills potentially critical for recognizing NCDs. Addressing this gap, we propose to investigate specific cognitive and linguistic challenges by analyzing topical shifts, temporal dynamics, and the coherence of narratives over time, aiming to reveal cognitive deficits by identifying narrative impairments, and exploring their impact on communication and cognition. The investigation is based on the CU-MARVEL Rabbit Story corpus, which comprises recordings of a story-telling task from 758 older adults. We developed two approaches: the Dynamic Topic Models (DTM)-based temporal analysis to examine the evolution of topics over time, and the Text-Image Temporal Alignment Network (TITAN) to evaluate the coherence between spoken narratives and visual stimuli. DTM-based approach validated the effectiveness of dynamic topic consistency as a macrostructural metric (F1=0.61, AUC=0.78). The TITAN approach achieved the highest performance (F1=0.72, AUC=0.81), surpassing established microstructural and macrostructural feature sets. Cross-comparison and regression tasks further demonstrated the effectiveness of proposed dynamic macrostructural modeling approaches for NCD detection.
△ Less
Submitted 7 January, 2025;
originally announced January 2025.
-
Analogue Forecast System for Daily Precipitation Prediction Using Autoencoder Feature Extraction: Application in Hong Kong
Authors:
Yee Chun Tsoi,
Yu Ting Kwok,
Ming Chun Lam,
Wai Kin Wong
Abstract:
In the Hong Kong Observatory, the Analogue Forecast System (AFS) for precipitation has been providing useful reference in predicting possible daily rainfall scenarios for the next 9 days, by identifying historical cases with similar weather patterns to the latest output from the deterministic model of the European Centre for Medium-Range Weather Forecasts (ECMWF). Recent advances in machine learni…
▽ More
In the Hong Kong Observatory, the Analogue Forecast System (AFS) for precipitation has been providing useful reference in predicting possible daily rainfall scenarios for the next 9 days, by identifying historical cases with similar weather patterns to the latest output from the deterministic model of the European Centre for Medium-Range Weather Forecasts (ECMWF). Recent advances in machine learning allow more sophisticated models to be trained using historical data and the patterns of high-impact weather events to be represented more effectively. As such, an enhanced AFS has been developed using the deep learning technique autoencoder. The datasets of the fifth generation of the ECMWF Reanalysis (ERA5) are utilised where more meteorological elements in higher horizontal, vertical and temporal resolutions are available as compared to the previous ECMWF reanalysis products used in the existing AFS. The enhanced AFS features four major steps in generating the daily rain class forecasts: (1) preprocessing of gridded ERA5 and ECMWF model forecast, (2) feature extraction by the pretrained autoencoder, (3) application of optimised feature weightings based on historical cases, and (4) calculation of the final rain class from a weighted ensemble of top analogues. The enhanced AFS demonstrates a consistent and superior performance over the existing AFS, especially in capturing heavy rain cases, during the verification period from 2019 to 2022. This paper presents the detailed formulation of the enhanced AFS and discusses its advantages and limitations in supporting precipitation forecasting in Hong Kong.
△ Less
Submitted 6 January, 2025;
originally announced January 2025.
-
Flavor Physics at CEPC: a General Perspective
Authors:
Xiaocong Ai,
Wolfgang Altmannshofer,
Peter Athron,
Xiaozhi Bai,
Lorenzo Calibbi,
Lu Cao,
Yuzhi Che,
Chunhui Chen,
Ji-Yuan Chen,
Long Chen,
Mingshui Chen,
Shanzhen Chen,
Xuan Chen,
Shan Cheng,
Cheng-Wei Chiang,
Andreas Crivellin,
Hanhua Cui,
Olivier Deschamps,
Sébastien Descotes-Genon,
Xiaokang Du,
Shuangshi Fang,
Yu Gao,
Li-Sheng Geng,
Pablo Goldenzweig,
Jiayin Gu
, et al. (116 additional authors not shown)
Abstract:
We discuss the landscape of flavor physics at the Circular Electron-Positron Collider (CEPC), based on the nominal luminosity outlined in its Technical Design Report. The CEPC is designed to operate in multiple modes to address a variety of tasks. At the $Z$ pole, the expected production of 4 Tera $Z$ bosons will provide unique and highly precise measurements of $Z$ boson couplings, while the subs…
▽ More
We discuss the landscape of flavor physics at the Circular Electron-Positron Collider (CEPC), based on the nominal luminosity outlined in its Technical Design Report. The CEPC is designed to operate in multiple modes to address a variety of tasks. At the $Z$ pole, the expected production of 4 Tera $Z$ bosons will provide unique and highly precise measurements of $Z$ boson couplings, while the substantial number of boosted heavy-flavored quarks and leptons produced in clean $Z$ decays will facilitate investigations into their flavor physics with unprecedented precision. We investigate the prospects of measuring various physics benchmarks and discuss their implications for particle theories and phenomenological models. Our studies indicate that, with its highlighted advantages and anticipated excellent detector performance, the CEPC can explore beauty and $τ$ physics in ways that are superior to or complementary with the Belle II and Large-Hadron-Collider-beauty experiments, potentially enabling the detection of new physics at energy scales of 10 TeV and above. This potential also extends to the observation of yet-to-be-discovered rare and exotic processes, as well as testing fundamental principles such as lepton flavor universality, lepton and baryon number conservation, etc., making the CEPC a vibrant platform for flavor physics research. The $WW$ threshold scan, Higgs-factory operation and top-pair productions of the CEPC further enhance its merits in this regard, especially for measuring the Cabibbo-Kobayashi-Maskawa matrix elements, and Flavor-Changing-Neutral-Current physics of Higgs boson and top quarks. We outline the requirements for detector performance and considerations for future development to achieve the anticipated scientific goals.
△ Less
Submitted 31 December, 2024; v1 submitted 27 December, 2024;
originally announced December 2024.
-
DEREC-SIMPRO: unlock Language Model benefits to advance Synthesis in Data Clean Room
Authors:
Tung Sum Thomas Kwok,
Chi-hua Wang,
Guang Cheng
Abstract:
Data collaboration via Data Clean Room offers value but raises privacy concerns, which can be addressed through synthetic data and multi-table synthesizers. Common multi-table synthesizers fail to perform when subjects occur repeatedly in both tables. This is an urgent yet unresolved problem, since having both tables with repeating subjects is common. To improve performance in this scenario, we pr…
▽ More
Data collaboration via Data Clean Room offers value but raises privacy concerns, which can be addressed through synthetic data and multi-table synthesizers. Common multi-table synthesizers fail to perform when subjects occur repeatedly in both tables. This is an urgent yet unresolved problem, since having both tables with repeating subjects is common. To improve performance in this scenario, we present the DEREC 3-step pre-processing pipeline to generalize adaptability of multi-table synthesizers. We also introduce the SIMPRO 3-aspect evaluation metrics, which leverage conditional distribution and large-scale simultaneous hypothesis testing to provide comprehensive feedback on synthetic data fidelity at both column and table levels. Results show that using DEREC improves fidelity, and multi-table synthesizers outperform single-table counterparts in collaboration settings. Together, the DEREC-SIMPRO pipeline offers a robust solution for generalizing data collaboration, promoting a more efficient, data-driven society.
△ Less
Submitted 31 October, 2024;
originally announced November 2024.
-
RouterDC: Query-Based Router by Dual Contrastive Learning for Assembling Large Language Models
Authors:
Shuhao Chen,
Weisen Jiang,
Baijiong Lin,
James T. Kwok,
Yu Zhang
Abstract:
Recent works show that assembling multiple off-the-shelf large language models (LLMs) can harness their complementary abilities. To achieve this, routing is a promising method, which learns a router to select the most suitable LLM for each query. However, existing routing models are ineffective when multiple LLMs perform well for a query. To address this problem, in this paper, we propose a method…
▽ More
Recent works show that assembling multiple off-the-shelf large language models (LLMs) can harness their complementary abilities. To achieve this, routing is a promising method, which learns a router to select the most suitable LLM for each query. However, existing routing models are ineffective when multiple LLMs perform well for a query. To address this problem, in this paper, we propose a method called query-based Router by Dual Contrastive learning (RouterDC). The RouterDC model consists of an encoder and LLM embeddings, and we propose two contrastive learning losses to train the RouterDC model. Experimental results show that RouterDC is effective in assembling LLMs and largely outperforms individual top-performing LLMs as well as existing routing methods on both in-distribution (+2.76\%) and out-of-distribution (+1.90\%) tasks. Source code is available at https://github.com/shuhao02/RouterDC.
△ Less
Submitted 29 September, 2024;
originally announced September 2024.
-
Underwater Organism Color Enhancement via Color Code Decomposition, Adaptation and Interpolation
Authors:
Xiaofeng Cong,
Jing Zhang,
Yeying Jin,
Junming Hou,
Yu Zhao,
Jie Gui,
James Tin-Yau Kwok,
Yuan Yan Tang
Abstract:
Underwater images often suffer from quality degradation due to absorption and scattering effects. Most existing underwater image enhancement algorithms produce a single, fixed-color image, limiting user flexibility and application. To address this limitation, we propose a method called \textit{ColorCode}, which enhances underwater images while offering a range of controllable color outputs. Our ap…
▽ More
Underwater images often suffer from quality degradation due to absorption and scattering effects. Most existing underwater image enhancement algorithms produce a single, fixed-color image, limiting user flexibility and application. To address this limitation, we propose a method called \textit{ColorCode}, which enhances underwater images while offering a range of controllable color outputs. Our approach involves recovering an underwater image to a reference enhanced image through supervised training and decomposing it into color and content codes via self-reconstruction and cross-reconstruction. The color code is explicitly constrained to follow a Gaussian distribution, allowing for efficient sampling and interpolation during inference. ColorCode offers three key features: 1) color enhancement, producing an enhanced image with a fixed color; 2) color adaptation, enabling controllable adjustments of long-wavelength color components using guidance images; and 3) color interpolation, allowing for the smooth generation of multiple colors through continuous sampling of the color code. Quantitative and visual evaluations on popular and challenging benchmark datasets demonstrate the superiority of ColorCode over existing methods in providing diverse, controllable, and color-realistic enhancement results. The source code is available at https://github.com/Xiaofeng-life/ColorCode.
△ Less
Submitted 29 September, 2024;
originally announced September 2024.
-
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
Authors:
Kai Chen,
Yunhao Gou,
Runhui Huang,
Zhili Liu,
Daxin Tan,
Jing Xu,
Chunwei Wang,
Yi Zhu,
Yihan Zeng,
Kuo Yang,
Dingdong Wang,
Kun Xiang,
Haoyuan Li,
Haoli Bai,
Jianhua Han,
Xiaohui Li,
Weike Jin,
Nian Xie,
Yu Zhang,
James T. Kwok,
Hengshuang Zhao,
Xiaodan Liang,
Dit-Yan Yeung,
Xiao Chen,
Zhenguo Li
, et al. (6 additional authors not shown)
Abstract:
GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and tones, marks a milestone for omni-modal foundation models. However, empowering Large Language Models to perceive and generate images, texts, and speeches end-to-end with publicly available data remains challenging for the open-source community. Existing vision-language models rely on external tools for speech pr…
▽ More
GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and tones, marks a milestone for omni-modal foundation models. However, empowering Large Language Models to perceive and generate images, texts, and speeches end-to-end with publicly available data remains challenging for the open-source community. Existing vision-language models rely on external tools for speech processing, while speech-language models still suffer from limited or totally without vision-understanding capabilities. To address this gap, we propose the EMOVA (EMotionally Omni-present Voice Assistant), to enable Large Language Models with end-to-end speech abilities while maintaining the leading vision-language performance. With a semantic-acoustic disentangled speech tokenizer, we surprisingly notice that omni-modal alignment can further enhance vision-language and speech abilities compared with the bi-modal aligned counterparts. Moreover, a lightweight style module is introduced for the flexible speech style controls including emotions and pitches. For the first time, EMOVA achieves state-of-the-art performance on both the vision-language and speech benchmarks, and meanwhile, supporting omni-modal spoken dialogue with vivid emotions.
△ Less
Submitted 20 March, 2025; v1 submitted 26 September, 2024;
originally announced September 2024.
-
Improving Fast Adversarial Training via Self-Knowledge Guidance
Authors:
Chengze Jiang,
Junkai Wang,
Minjing Dong,
Jie Gui,
Xinli Shi,
Yuan Cao,
Yuan Yan Tang,
James Tin-Yau Kwok
Abstract:
Adversarial training has achieved remarkable advancements in defending against adversarial attacks. Among them, fast adversarial training (FAT) is gaining attention for its ability to achieve competitive robustness with fewer computing resources. Existing FAT methods typically employ a uniform strategy that optimizes all training data equally without considering the influence of different examples…
▽ More
Adversarial training has achieved remarkable advancements in defending against adversarial attacks. Among them, fast adversarial training (FAT) is gaining attention for its ability to achieve competitive robustness with fewer computing resources. Existing FAT methods typically employ a uniform strategy that optimizes all training data equally without considering the influence of different examples, which leads to an imbalanced optimization. However, this imbalance remains unexplored in the field of FAT. In this paper, we conduct a comprehensive study of the imbalance issue in FAT and observe an obvious class disparity regarding their performances. This disparity could be embodied from a perspective of alignment between clean and robust accuracy. Based on the analysis, we mainly attribute the observed misalignment and disparity to the imbalanced optimization in FAT, which motivates us to optimize different training data adaptively to enhance robustness. Specifically, we take disparity and misalignment into consideration. First, we introduce self-knowledge guided regularization, which assigns differentiated regularization weights to each class based on its training state, alleviating class disparity. Additionally, we propose self-knowledge guided label relaxation, which adjusts label relaxation according to the training accuracy, alleviating the misalignment and improving robustness. By combining these methods, we formulate the Self-Knowledge Guided FAT (SKG-FAT), leveraging naturally generated knowledge during training to enhance the adversarial robustness without compromising training efficiency. Extensive experiments on four standard datasets demonstrate that the SKG-FAT improves the robustness and preserves competitive clean accuracy, outperforming the state-of-the-art methods.
△ Less
Submitted 26 September, 2024;
originally announced September 2024.
-
CFVNet: An End-to-End Cancelable Finger Vein Network for Recognition
Authors:
Yifan Wang,
Jie Gui,
Yuan Yan Tang,
James Tin-Yau Kwok
Abstract:
Finger vein recognition technology has become one of the primary solutions for high-security identification systems. However, it still has information leakage problems, which seriously jeopardizes users privacy and anonymity and cause great security risks. In addition, there is no work to consider a fully integrated secure finger vein recognition system. So, different from the previous systems, we…
▽ More
Finger vein recognition technology has become one of the primary solutions for high-security identification systems. However, it still has information leakage problems, which seriously jeopardizes users privacy and anonymity and cause great security risks. In addition, there is no work to consider a fully integrated secure finger vein recognition system. So, different from the previous systems, we integrate preprocessing and template protection into an integrated deep learning model. We propose an end-to-end cancelable finger vein network (CFVNet), which can be used to design an secure finger vein recognition system.It includes a plug-and-play BWR-ROIAlign unit, which consists of three sub-modules: Localization, Compression and Transformation. The localization module achieves automated localization of stable and unique finger vein ROI. The compression module losslessly removes spatial and channel redundancies. The transformation module uses the proposed BWR method to introduce unlinkability, irreversibility and revocability to the system. BWR-ROIAlign can directly plug into the model to introduce the above features for DCNN-based finger vein recognition systems. We perform extensive experiments on four public datasets to study the performance and cancelable biometric attributes of the CFVNet-based recognition system. The average accuracy, EERs and Dsys on the four datasets are 99.82%, 0.01% and 0.025, respectively, and achieves competitive performance compared with the state-of-the-arts.
△ Less
Submitted 23 September, 2024;
originally announced September 2024.
-
Unrevealed Threats: A Comprehensive Study of the Adversarial Robustness of Underwater Image Enhancement Models
Authors:
Siyu Zhai,
Zhibo He,
Xiaofeng Cong,
Junming Hou,
Jie Gui,
Jian Wei You,
Xin Gong,
James Tin-Yau Kwok,
Yuan Yan Tang
Abstract:
Learning-based methods for underwater image enhancement (UWIE) have undergone extensive exploration. However, learning-based models are usually vulnerable to adversarial examples so as the UWIE models. To the best of our knowledge, there is no comprehensive study on the adversarial robustness of UWIE models, which indicates that UWIE models are potentially under the threat of adversarial attacks.…
▽ More
Learning-based methods for underwater image enhancement (UWIE) have undergone extensive exploration. However, learning-based models are usually vulnerable to adversarial examples so as the UWIE models. To the best of our knowledge, there is no comprehensive study on the adversarial robustness of UWIE models, which indicates that UWIE models are potentially under the threat of adversarial attacks. In this paper, we propose a general adversarial attack protocol. We make a first attempt to conduct adversarial attacks on five well-designed UWIE models on three common underwater image benchmark datasets. Considering the scattering and absorption of light in the underwater environment, there exists a strong correlation between color correction and underwater image enhancement. On the basis of that, we also design two effective UWIE-oriented adversarial attack methods Pixel Attack and Color Shift Attack targeting different color spaces. The results show that five models exhibit varying degrees of vulnerability to adversarial attacks and well-designed small perturbations on degraded images are capable of preventing UWIE models from generating enhanced results. Further, we conduct adversarial training on these models and successfully mitigated the effectiveness of adversarial attacks. In summary, we reveal the adversarial vulnerability of UWIE models and propose a new evaluation dimension of UWIE models.
△ Less
Submitted 10 September, 2024;
originally announced September 2024.
-
CathAction: A Benchmark for Endovascular Intervention Understanding
Authors:
Baoru Huang,
Tuan Vo,
Chayun Kongtongvattana,
Giulio Dagnino,
Dennis Kundrat,
Wenqiang Chi,
Mohamed Abdelaziz,
Trevor Kwok,
Tudor Jianu,
Tuong Do,
Hieu Le,
Minh Nguyen,
Hoan Nguyen,
Erman Tjiputra,
Quang Tran,
Jianyang Xie,
Yanda Meng,
Binod Bhattarai,
Zhaorui Tan,
Hongbin Liu,
Hong Seng Gan,
Wei Wang,
Xi Yang,
Qiufeng Wang,
Jionglong Su
, et al. (13 additional authors not shown)
Abstract:
Real-time visual feedback from catheterization analysis is crucial for enhancing surgical safety and efficiency during endovascular interventions. However, existing datasets are often limited to specific tasks, small scale, and lack the comprehensive annotations necessary for broader endovascular intervention understanding. To tackle these limitations, we introduce CathAction, a large-scale datase…
▽ More
Real-time visual feedback from catheterization analysis is crucial for enhancing surgical safety and efficiency during endovascular interventions. However, existing datasets are often limited to specific tasks, small scale, and lack the comprehensive annotations necessary for broader endovascular intervention understanding. To tackle these limitations, we introduce CathAction, a large-scale dataset for catheterization understanding. Our CathAction dataset encompasses approximately 500,000 annotated frames for catheterization action understanding and collision detection, and 25,000 ground truth masks for catheter and guidewire segmentation. For each task, we benchmark recent related works in the field. We further discuss the challenges of endovascular intentions compared to traditional computer vision tasks and point out open research questions. We hope that CathAction will facilitate the development of endovascular intervention understanding methods that can be applied to real-world applications. The dataset is available at https://airvlab.github.io/cathaction/.
△ Less
Submitted 30 August, 2024; v1 submitted 23 August, 2024;
originally announced August 2024.
-
Efficient Pareto Manifold Learning with Low-Rank Structure
Authors:
Weiyu Chen,
James T. Kwok
Abstract:
Multi-task learning, which optimizes performance across multiple tasks, is inherently a multi-objective optimization problem. Various algorithms are developed to provide discrete trade-off solutions on the Pareto front. Recently, continuous Pareto front approximations using a linear combination of base networks have emerged as a compelling strategy. However, it suffers from scalability issues when…
▽ More
Multi-task learning, which optimizes performance across multiple tasks, is inherently a multi-objective optimization problem. Various algorithms are developed to provide discrete trade-off solutions on the Pareto front. Recently, continuous Pareto front approximations using a linear combination of base networks have emerged as a compelling strategy. However, it suffers from scalability issues when the number of tasks is large. To address this issue, we propose a novel approach that integrates a main network with several low-rank matrices to efficiently learn the Pareto manifold. It significantly reduces the number of parameters and facilitates the extraction of shared features. We also introduce orthogonal regularization to further bolster performance. Extensive experimental results demonstrate that the proposed approach outperforms state-of-the-art baselines, especially on datasets with a large number of tasks.
△ Less
Submitted 30 July, 2024;
originally announced July 2024.
-
Learning Scalable Model Soup on a Single GPU: An Efficient Subspace Training Strategy
Authors:
Tao Li,
Weisen Jiang,
Fanghui Liu,
Xiaolin Huang,
James T. Kwok
Abstract:
Pre-training followed by fine-tuning is widely adopted among practitioners. The performance can be improved by "model soups"~\cite{wortsman2022model} via exploring various hyperparameter configurations.The Learned-Soup, a variant of model soups, significantly improves the performance but suffers from substantial memory and time costs due to the requirements of (i) having to load all fine-tuned mod…
▽ More
Pre-training followed by fine-tuning is widely adopted among practitioners. The performance can be improved by "model soups"~\cite{wortsman2022model} via exploring various hyperparameter configurations.The Learned-Soup, a variant of model soups, significantly improves the performance but suffers from substantial memory and time costs due to the requirements of (i) having to load all fine-tuned models simultaneously, and (ii) a large computational graph encompassing all fine-tuned models. In this paper, we propose Memory Efficient Hyperplane Learned Soup (MEHL-Soup) to tackle this issue by formulating the learned soup as a hyperplane optimization problem and introducing block coordinate gradient descent to learn the mixing coefficients. At each iteration, MEHL-Soup only needs to load a few fine-tuned models and build a computational graph with one combined model. We further extend MEHL-Soup to MEHL-Soup+ in a layer-wise manner. Experimental results on various ViT models and data sets show that MEHL-Soup(+) outperforms Learned-Soup(+) in terms of test accuracy, and also reduces memory usage by more than $13\times$. Moreover, MEHL-Soup(+) can be run on a single GPU and achieves $9\times$ speed up in soup construction compared with the Learned-Soup. The code is released at https://github.com/nblt/MEHL-Soup.
△ Less
Submitted 23 July, 2024; v1 submitted 4 July, 2024;
originally announced July 2024.
-
Communication-Efficient and Privacy-Preserving Decentralized Meta-Learning
Authors:
Hansi Yang,
James T. Kwok
Abstract:
Distributed learning, which does not require gathering training data in a central location, has become increasingly important in the big-data era. In particular, random-walk-based decentralized algorithms are flexible in that they do not need a central server trusted by all clients and do not require all clients to be active in all iterations. However, existing distributed learning algorithms assu…
▽ More
Distributed learning, which does not require gathering training data in a central location, has become increasingly important in the big-data era. In particular, random-walk-based decentralized algorithms are flexible in that they do not need a central server trusted by all clients and do not require all clients to be active in all iterations. However, existing distributed learning algorithms assume that all learning clients share the same task. In this paper, we consider the more difficult meta-learning setting, in which different clients perform different (but related) tasks with limited training data. To reduce communication cost and allow better privacy protection, we propose LoDMeta (Local Decentralized Meta-learning) with the use of local auxiliary optimization parameters and random perturbations on the model parameter. Theoretical results are provided on both convergence and privacy analysis. Empirical results on a number of few-shot learning data sets demonstrate that LoDMeta has similar meta-learning accuracy as centralized meta-learning algorithms, but does not require gathering data from each client and is able to better protect data privacy for each client.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
Mixup Augmentation with Multiple Interpolations
Authors:
Lifeng Shen,
Jincheng Yu,
Hansi Yang,
James T. Kwok
Abstract:
Mixup and its variants form a popular class of data augmentation techniques.Using a random sample pair, it generates a new sample by linear interpolation of the inputs and labels. However, generating only one single interpolation may limit its augmentation ability. In this paper, we propose a simple yet effective extension called multi-mix, which generates multiple interpolations from a sample pai…
▽ More
Mixup and its variants form a popular class of data augmentation techniques.Using a random sample pair, it generates a new sample by linear interpolation of the inputs and labels. However, generating only one single interpolation may limit its augmentation ability. In this paper, we propose a simple yet effective extension called multi-mix, which generates multiple interpolations from a sample pair. With an ordered sequence of generated samples, multi-mix can better guide the training process than standard mixup. Moreover, theoretically, this can also reduce the stochastic gradient variance. Extensive experiments on a number of synthetic and large-scale data sets demonstrate that multi-mix outperforms various mixup variants and non-mixup-based baselines in terms of generalization, robustness, and calibration.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
Direct Alignment of Language Models via Quality-Aware Self-Refinement
Authors:
Runsheng Yu,
Yong Wang,
Xiaoqi Jiao,
Youzhi Zhang,
James T. Kwok
Abstract:
Reinforcement Learning from Human Feedback (RLHF) has been commonly used to align the behaviors of Large Language Models (LLMs) with human preferences. Recently, a popular alternative is Direct Policy Optimization (DPO), which replaces an LLM-based reward model with the policy itself, thus obviating the need for extra memory and training time to learn the reward model. However, DPO does not consid…
▽ More
Reinforcement Learning from Human Feedback (RLHF) has been commonly used to align the behaviors of Large Language Models (LLMs) with human preferences. Recently, a popular alternative is Direct Policy Optimization (DPO), which replaces an LLM-based reward model with the policy itself, thus obviating the need for extra memory and training time to learn the reward model. However, DPO does not consider the relative qualities of the positive and negative responses, and can lead to sub-optimal training outcomes. To alleviate this problem, we investigate the use of intrinsic knowledge within the on-the-fly fine-tuning LLM to obtain relative qualities and help to refine the loss function. Specifically, we leverage the knowledge of the LLM to design a refinement function to estimate the quality of both the positive and negative responses. We show that the constructed refinement function can help self-refine the loss function under mild assumptions. The refinement function is integrated into DPO and its variant Identity Policy Optimization (IPO). Experiments across various evaluators indicate that they can improve the performance of the fine-tuned models over DPO and IPO.
△ Less
Submitted 31 May, 2024;
originally announced May 2024.
-
Mixture of insighTful Experts (MoTE): The Synergy of Thought Chains and Expert Mixtures in Self-Alignment
Authors:
Zhili Liu,
Yunhao Gou,
Kai Chen,
Lanqing Hong,
Jiahui Gao,
Fei Mi,
Yu Zhang,
Zhenguo Li,
Xin Jiang,
Qun Liu,
James T. Kwok
Abstract:
As the capabilities of large language models (LLMs) continue to expand, aligning these models with human values remains a significant challenge. Recent studies show that reasoning abilities contribute significantly to model safety, while integrating Mixture-of-Experts (MoE) architectures can further enhance alignment. In this work, we address a fundamental question: How to effectively incorporate…
▽ More
As the capabilities of large language models (LLMs) continue to expand, aligning these models with human values remains a significant challenge. Recent studies show that reasoning abilities contribute significantly to model safety, while integrating Mixture-of-Experts (MoE) architectures can further enhance alignment. In this work, we address a fundamental question: How to effectively incorporate reasoning abilities and MoE architectures into self-alignment process in LLMs? We propose Mixture of insighTful Experts (MoTE), a novel framework that synergistically combines reasoning chains and expert mixtures to improve self-alignments. From a data perspective, MoTE employs a structured reasoning chain comprising four key stages: Question Analysis, Answer Guidance, Safe Answer, and Safety Checking. This approach enhances safety through multi-step reasoning and proves effective even for smaller and less powerful LLMs (e.g., 7B models). From an architectural perspective, MoTE adopts a multi-LoRA framework with step-level routing, where each expert is dedicated to a specific reasoning step. This design eliminates the need for balance losses, ensures stable training, and supports adaptive inference lengths. Experimental results demonstrate that MoTE significantly improves model safety, jailbreak resistance, and over-refusal capabilities, achieving performance comparable to OpenAI's state-of-the-art o1 model.
△ Less
Submitted 1 June, 2025; v1 submitted 1 May, 2024;
originally announced May 2024.
-
Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation
Authors:
Yunhao Gou,
Kai Chen,
Zhili Liu,
Lanqing Hong,
Hang Xu,
Zhenguo Li,
Dit-Yan Yeung,
James T. Kwok,
Yu Zhang
Abstract:
Multimodal large language models (MLLMs) have shown impressive reasoning abilities. However, they are also more vulnerable to jailbreak attacks than their LLM predecessors. Although still capable of detecting the unsafe responses, we observe that safety mechanisms of the pre-aligned LLMs in MLLMs can be easily bypassed with the introduction of image features. To construct robust MLLMs, we propose…
▽ More
Multimodal large language models (MLLMs) have shown impressive reasoning abilities. However, they are also more vulnerable to jailbreak attacks than their LLM predecessors. Although still capable of detecting the unsafe responses, we observe that safety mechanisms of the pre-aligned LLMs in MLLMs can be easily bypassed with the introduction of image features. To construct robust MLLMs, we propose ECSO (Eyes Closed, Safety On), a novel training-free protecting approach that exploits the inherent safety awareness of MLLMs, and generates safer responses via adaptively transforming unsafe images into texts to activate the intrinsic safety mechanism of pre-aligned LLMs in MLLMs. Experiments on five state-of-the-art (SoTA) MLLMs demonstrate that ECSO enhances model safety significantly (e.g.,, 37.6% improvement on the MM-SafetyBench (SD+OCR) and 71.3% on VLSafe with LLaVA-1.5-7B), while consistently maintaining utility results on common MLLM benchmarks. Furthermore, we show that ECSO can be used as a data engine to generate supervised-finetuning (SFT) data for MLLM alignment without extra human intervention.
△ Less
Submitted 15 October, 2024; v1 submitted 14 March, 2024;
originally announced March 2024.
-
Task-customized Masked AutoEncoder via Mixture of Cluster-conditional Experts
Authors:
Zhili Liu,
Kai Chen,
Jianhua Han,
Lanqing Hong,
Hang Xu,
Zhenguo Li,
James T. Kwok
Abstract:
Masked Autoencoder~(MAE) is a prevailing self-supervised learning method that achieves promising results in model pre-training. However, when the various downstream tasks have data distributions different from the pre-training data, the semantically irrelevant pre-training information might result in negative transfer, impeding MAE's scalability. To address this issue, we propose a novel MAE-based…
▽ More
Masked Autoencoder~(MAE) is a prevailing self-supervised learning method that achieves promising results in model pre-training. However, when the various downstream tasks have data distributions different from the pre-training data, the semantically irrelevant pre-training information might result in negative transfer, impeding MAE's scalability. To address this issue, we propose a novel MAE-based pre-training paradigm, Mixture of Cluster-conditional Experts (MoCE), which can be trained once but provides customized pre-training models for diverse downstream tasks. Different from the mixture of experts (MoE), our MoCE trains each expert only with semantically relevant images by using cluster-conditional gates. Thus, each downstream task can be allocated to its customized model pre-trained with data most similar to the downstream data. Experiments on a collection of 11 downstream tasks show that MoCE outperforms the vanilla MAE by 2.45\% on average. It also obtains new state-of-the-art self-supervised learning results on detection and segmentation.
△ Less
Submitted 7 February, 2024;
originally announced February 2024.
-
KICGPT: Large Language Model with Knowledge in Context for Knowledge Graph Completion
Authors:
Yanbin Wei,
Qiushi Huang,
James T. Kwok,
Yu Zhang
Abstract:
Knowledge Graph Completion (KGC) is crucial for addressing knowledge graph incompleteness and supporting downstream applications. Many models have been proposed for KGC. They can be categorized into two main classes: triple-based and text-based approaches. Triple-based methods struggle with long-tail entities due to limited structural information and imbalanced entity distributions. Text-based met…
▽ More
Knowledge Graph Completion (KGC) is crucial for addressing knowledge graph incompleteness and supporting downstream applications. Many models have been proposed for KGC. They can be categorized into two main classes: triple-based and text-based approaches. Triple-based methods struggle with long-tail entities due to limited structural information and imbalanced entity distributions. Text-based methods alleviate this issue but require costly training for language models and specific finetuning for knowledge graphs, which limits their efficiency. To alleviate these limitations, in this paper, we propose KICGPT, a framework that integrates a large language model (LLM) and a triple-based KGC retriever. It alleviates the long-tail problem without incurring additional training overhead. KICGPT uses an in-context learning strategy called Knowledge Prompt, which encodes structural knowledge into demonstrations to guide the LLM. Empirical results on benchmark datasets demonstrate the effectiveness of KICGPT with smaller training overhead and no finetuning.
△ Less
Submitted 23 February, 2024; v1 submitted 4 February, 2024;
originally announced February 2024.
-
GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning
Authors:
Yanbin Wei,
Shuai Fu,
Weisen Jiang,
Zejian Zhang,
Zhixiong Zeng,
Qi Wu,
James T. Kwok,
Yu Zhang
Abstract:
Large Language Models (LLMs) are increasingly used for various tasks with graph structures. Though LLMs can process graph information in a textual format, they overlook the rich vision modality, which is an intuitive way for humans to comprehend structural information and conduct general graph reasoning. The potential benefits and capabilities of representing graph structures as visual images (i.e…
▽ More
Large Language Models (LLMs) are increasingly used for various tasks with graph structures. Though LLMs can process graph information in a textual format, they overlook the rich vision modality, which is an intuitive way for humans to comprehend structural information and conduct general graph reasoning. The potential benefits and capabilities of representing graph structures as visual images (i.e., $\textit{visual graph}$) are still unexplored. To fill the gap, we innovatively propose an end-to-end framework, called $\textbf{G}$raph to v$\textbf{I}$sual and $\textbf{T}$extual Integr$\textbf{A}$tion (GITA), which firstly incorporates visual graphs into general graph reasoning. Besides, we establish $\textbf{G}$raph-based $\textbf{V}$ision-$\textbf{L}$anguage $\textbf{Q}$uestion $\textbf{A}$nswering (GVLQA) dataset from existing graph data, which is the first vision-language dataset for general graph reasoning purposes. Extensive experiments on the GVLQA dataset and five real-world datasets show that GITA outperforms mainstream LLMs in terms of general graph reasoning capabilities. Moreover, We highlight the effectiveness of the layout augmentation on visual graphs and pretraining on the GVLQA dataset.
△ Less
Submitted 31 October, 2024; v1 submitted 3 February, 2024;
originally announced February 2024.
-
Compositional Oil Spill Detection Based on Object Detector and Adapted Segment Anything Model from SAR Images
Authors:
Wenhui Wu,
Man Sing Wong,
Xinyu Yu,
Guoqiang Shi,
Coco Yin Tung Kwok,
Kang Zou
Abstract:
Semantic segmentation-based methods have attracted extensive attention in oil spill detection from SAR images. However, the existing approaches require a large number of finely annotated segmentation samples in the training stage. To alleviate this issue, we propose a composite oil spill detection framework, SAM-OIL, comprising an object detector (e.g., YOLOv8), an Adapted Segment Anything Model (…
▽ More
Semantic segmentation-based methods have attracted extensive attention in oil spill detection from SAR images. However, the existing approaches require a large number of finely annotated segmentation samples in the training stage. To alleviate this issue, we propose a composite oil spill detection framework, SAM-OIL, comprising an object detector (e.g., YOLOv8), an Adapted Segment Anything Model (SAM), and an Ordered Mask Fusion (OMF) module. SAM-OIL is the first application of the powerful SAM in oil spill detection. Specifically, the SAM-OIL strategy uses YOLOv8 to obtain the categories and bounding boxes of oil spill-related objects, then inputs bounding boxes into the Adapted SAM to retrieve category-agnostic masks, and finally adopts the OMF module to fuse the masks and categories. The Adapted SAM, combining a frozen SAM with a learnable Adapter module, can enhance SAM's ability to segment ambiguous objects. The OMF module, a parameter-free method, can effectively resolve pixel category conflicts within SAM. Experimental results demonstrate that SAM-OIL surpasses existing semantic segmentation-based oil spill detection methods, achieving mIoU of 69.52\%. The results also indicated that both OMF and Adapter modules can effectively improve the accuracy in SAM-OIL.
△ Less
Submitted 22 December, 2024; v1 submitted 15 January, 2024;
originally announced January 2024.
-
Mixture of Cluster-conditional LoRA Experts for Vision-language Instruction Tuning
Authors:
Yunhao Gou,
Zhili Liu,
Kai Chen,
Lanqing Hong,
Hang Xu,
Aoxue Li,
Dit-Yan Yeung,
James T. Kwok,
Yu Zhang
Abstract:
Instruction tuning of Large Vision-language Models (LVLMs) has revolutionized the development of versatile models with zero-shot generalization across a wide range of downstream vision-language tasks. However, the diversity of training tasks of different sources and formats would lead to inevitable task conflicts, where different tasks conflict for the same set of model parameters, resulting in su…
▽ More
Instruction tuning of Large Vision-language Models (LVLMs) has revolutionized the development of versatile models with zero-shot generalization across a wide range of downstream vision-language tasks. However, the diversity of training tasks of different sources and formats would lead to inevitable task conflicts, where different tasks conflict for the same set of model parameters, resulting in sub-optimal instruction-following abilities. To address that, we propose the Mixture of Cluster-conditional LoRA Experts (MoCLE), a novel Mixture of Experts (MoE) architecture designed to activate the task-customized model parameters based on the instruction clusters. A separate universal expert is further incorporated to improve generalization capabilities of MoCLE for novel instructions. Extensive experiments on InstructBLIP and LLaVA demonstrate the effectiveness of MoCLE.
△ Less
Submitted 3 July, 2024; v1 submitted 19 December, 2023;
originally announced December 2023.
-
Aggregation Weighting of Federated Learning via Generalization Bound Estimation
Authors:
Mingwei Xu,
Xiaofeng Cao,
Ivor W. Tsang,
James T. Kwok
Abstract:
Federated Learning (FL) typically aggregates client model parameters using a weighting approach determined by sample proportions. However, this naive weighting method may lead to unfairness and degradation in model performance due to statistical heterogeneity and the inclusion of noisy data among clients. Theoretically, distributional robustness analysis has shown that the generalization performan…
▽ More
Federated Learning (FL) typically aggregates client model parameters using a weighting approach determined by sample proportions. However, this naive weighting method may lead to unfairness and degradation in model performance due to statistical heterogeneity and the inclusion of noisy data among clients. Theoretically, distributional robustness analysis has shown that the generalization performance of a learning model with respect to any shifted distribution is bounded. This motivates us to reconsider the weighting approach in federated learning. In this paper, we replace the aforementioned weighting method with a new strategy that considers the generalization bounds of each local model. Specifically, we estimate the upper and lower bounds of the second-order origin moment of the shifted distribution for the current local model, and then use these bounds disagreements as the aggregation proportions for weightings in each communication round. Experiments demonstrate that the proposed weighting strategy significantly improves the performance of several representative FL algorithms on benchmark datasets.
△ Less
Submitted 10 November, 2023;
originally announced November 2023.
-
ADMarker: A Multi-Modal Federated Learning System for Monitoring Digital Biomarkers of Alzheimer's Disease
Authors:
Xiaomin Ouyang,
Xian Shuai,
Yang Li,
Li Pan,
Xifan Zhang,
Heming Fu,
Sitong Cheng,
Xinyan Wang,
Shihua Cao,
Jiang Xin,
Hazel Mok,
Zhenyu Yan,
Doris Sau Fung Yu,
Timothy Kwok,
Guoliang Xing
Abstract:
Alzheimer's Disease (AD) and related dementia are a growing global health challenge due to the aging population. In this paper, we present ADMarker, the first end-to-end system that integrates multi-modal sensors and new federated learning algorithms for detecting multidimensional AD digital biomarkers in natural living environments. ADMarker features a novel three-stage multi-modal federated lear…
▽ More
Alzheimer's Disease (AD) and related dementia are a growing global health challenge due to the aging population. In this paper, we present ADMarker, the first end-to-end system that integrates multi-modal sensors and new federated learning algorithms for detecting multidimensional AD digital biomarkers in natural living environments. ADMarker features a novel three-stage multi-modal federated learning architecture that can accurately detect digital biomarkers in a privacy-preserving manner. Our approach collectively addresses several major real-world challenges, such as limited data labels, data heterogeneity, and limited computing resources. We built a compact multi-modality hardware system and deployed it in a four-week clinical trial involving 91 elderly participants. The results indicate that ADMarker can accurately detect a comprehensive set of digital biomarkers with up to 93.8% accuracy and identify early AD with an average of 88.9% accuracy. ADMarker offers a new platform that can allow AD clinicians to characterize and track the complex correlation between multidimensional interpretable digital biomarkers, demographic factors of patients, and AD diagnosis in a longitudinal manner.
△ Less
Submitted 12 April, 2024; v1 submitted 23 October, 2023;
originally announced October 2023.
-
BYOM: Building Your Own Multi-Task Model For Free
Authors:
Weisen Jiang,
Baijiong Lin,
Han Shi,
Yu Zhang,
Zhenguo Li,
James T. Kwok
Abstract:
Recently, various merging methods have been proposed to build a multi-task model from task-specific finetuned models without retraining. However, existing methods suffer from a large performance deterioration compared to using multiple task-specific models. In this paper, we propose to inject task-specific knowledge into the merged model and design two parameter-efficient approaches (BYOM-FFT and…
▽ More
Recently, various merging methods have been proposed to build a multi-task model from task-specific finetuned models without retraining. However, existing methods suffer from a large performance deterioration compared to using multiple task-specific models. In this paper, we propose to inject task-specific knowledge into the merged model and design two parameter-efficient approaches (BYOM-FFT and BYOM-LoRA) to Build Your Own Multi-task model. BYOM-FFT is for merging fully finetuned models, while BYOM-LoRA is for LoRA-finetuned models. Both methods are data-free and computation-efficient. Extensive experiments on computer vision and natural language processing tasks show that the proposed BYOM methods outperform existing merging methods by a large margin. Moreover, BYOM-FFT is general and can be integrated into existing merging methods to further boost performance.
△ Less
Submitted 3 February, 2024; v1 submitted 3 October, 2023;
originally announced October 2023.
-
Domain-Guided Conditional Diffusion Model for Unsupervised Domain Adaptation
Authors:
Yulong Zhang,
Shuhao Chen,
Weisen Jiang,
Yu Zhang,
Jiangang Lu,
James T. Kwok
Abstract:
Limited transferability hinders the performance of deep learning models when applied to new application scenarios. Recently, Unsupervised Domain Adaptation (UDA) has achieved significant progress in addressing this issue via learning domain-invariant features. However, the performance of existing UDA methods is constrained by the large domain shift and limited target domain data. To alleviate thes…
▽ More
Limited transferability hinders the performance of deep learning models when applied to new application scenarios. Recently, Unsupervised Domain Adaptation (UDA) has achieved significant progress in addressing this issue via learning domain-invariant features. However, the performance of existing UDA methods is constrained by the large domain shift and limited target domain data. To alleviate these issues, we propose DomAin-guided Conditional Diffusion Model (DACDM) to generate high-fidelity and diversity samples for the target domain. In the proposed DACDM, by introducing class information, the labels of generated samples can be controlled, and a domain classifier is further introduced in DACDM to guide the generated samples for the target domain. The generated samples help existing UDA methods transfer from the source domain to the target domain more easily, thus improving the transfer performance. Extensive experiments on various benchmarks demonstrate that DACDM brings a large improvement to the performance of existing UDA methods.
△ Less
Submitted 23 September, 2023;
originally announced September 2023.
-
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
Authors:
Longhui Yu,
Weisen Jiang,
Han Shi,
Jincheng Yu,
Zhengying Liu,
Yu Zhang,
James T. Kwok,
Zhenguo Li,
Adrian Weller,
Weiyang Liu
Abstract:
Large language models (LLMs) have pushed the limits of natural language understanding and exhibited excellent problem-solving ability. Despite the great success, most existing open-source LLMs (e.g., LLaMA-2) are still far away from satisfactory for solving mathematical problem due to the complex reasoning procedures. To bridge this gap, we propose MetaMath, a fine-tuned language model that specia…
▽ More
Large language models (LLMs) have pushed the limits of natural language understanding and exhibited excellent problem-solving ability. Despite the great success, most existing open-source LLMs (e.g., LLaMA-2) are still far away from satisfactory for solving mathematical problem due to the complex reasoning procedures. To bridge this gap, we propose MetaMath, a fine-tuned language model that specializes in mathematical reasoning. Specifically, we start by bootstrapping mathematical questions by rewriting the question from multiple perspectives without extra knowledge, which results in a new dataset called MetaMathQA. Then we fine-tune the LLaMA-2 models on MetaMathQA. Experimental results on two popular benchmarks (i.e., GSM8K and MATH) for mathematical reasoning demonstrate that MetaMath outperforms a suite of open-source LLMs by a significant margin. Our MetaMath-7B model achieves 66.4% on GSM8K and 19.4% on MATH, exceeding the state-of-the-art models of the same size by 11.5% and 8.7%. Particularly, MetaMath-70B achieves an accuracy of 82.3% on GSM8K, slightly better than GPT-3.5-Turbo. We release all the MetaMathQA dataset, the MetaMath models with different model sizes and the training code for public use.
△ Less
Submitted 3 May, 2024; v1 submitted 21 September, 2023;
originally announced September 2023.
-
Dual-Balancing for Multi-Task Learning
Authors:
Baijiong Lin,
Weisen Jiang,
Feiyang Ye,
Yu Zhang,
Pengguang Chen,
Ying-Cong Chen,
Shu Liu,
James T. Kwok
Abstract:
Multi-task learning (MTL), a learning paradigm to learn multiple related tasks simultaneously, has achieved great success in various fields. However, task balancing problem remains a significant challenge in MTL, with the disparity in loss/gradient scales often leading to performance compromises. In this paper, we propose a Dual-Balancing Multi-Task Learning (DB-MTL) method to alleviate the task b…
▽ More
Multi-task learning (MTL), a learning paradigm to learn multiple related tasks simultaneously, has achieved great success in various fields. However, task balancing problem remains a significant challenge in MTL, with the disparity in loss/gradient scales often leading to performance compromises. In this paper, we propose a Dual-Balancing Multi-Task Learning (DB-MTL) method to alleviate the task balancing problem from both loss and gradient perspectives. Specifically, DB-MTL ensures loss-scale balancing by performing a logarithm transformation on each task loss, and guarantees gradient-magnitude balancing via normalizing all task gradients to the same magnitude as the maximum gradient norm. Extensive experiments conducted on several benchmark datasets consistently demonstrate the state-of-the-art performance of DB-MTL.
△ Less
Submitted 29 September, 2023; v1 submitted 23 August, 2023;
originally announced August 2023.
-
Forward-Backward Reasoning in Large Language Models for Mathematical Verification
Authors:
Weisen Jiang,
Han Shi,
Longhui Yu,
Zhengying Liu,
Yu Zhang,
Zhenguo Li,
James T. Kwok
Abstract:
Self-Consistency samples diverse reasoning chains with answers and chooses the final answer by majority voting. It is based on forward reasoning and cannot further improve performance by sampling more reasoning chains when saturated. To further boost performance, we introduce backward reasoning to verify candidate answers. Specifically, for mathematical tasks, we mask a number in the question and…
▽ More
Self-Consistency samples diverse reasoning chains with answers and chooses the final answer by majority voting. It is based on forward reasoning and cannot further improve performance by sampling more reasoning chains when saturated. To further boost performance, we introduce backward reasoning to verify candidate answers. Specifically, for mathematical tasks, we mask a number in the question and ask the LLM to answer a backward question created by a simple template, i.e., to predict the masked number when a candidate answer is provided. Instead of using forward or backward reasoning alone, we propose FOBAR to combine FOrward and BAckward Reasoning for verification. Extensive experiments on six standard mathematical data sets and three LLMs show that FOBAR achieves state-of-the-art performance. In particular, FOBAR outperforms Self-Consistency, which uses forward reasoning alone, demonstrating that combining forward and forward reasoning is better. In addition, FOBAR performs better than existing verification methods, showing the effectiveness of the simple template used in backward reasoning and the proposed combination. Extensions to non-mathematical problems are also discussed and validated empirically.
△ Less
Submitted 4 June, 2024; v1 submitted 15 August, 2023;
originally announced August 2023.
-
Illumination Controllable Dehazing Network based on Unsupervised Retinex Embedding
Authors:
Jie Gui,
Xiaofeng Cong,
Lei He,
Yuan Yan Tang,
James Tin-Yau Kwok
Abstract:
On the one hand, the dehazing task is an illposedness problem, which means that no unique solution exists. On the other hand, the dehazing task should take into account the subjective factor, which is to give the user selectable dehazed images rather than a single result. Therefore, this paper proposes a multi-output dehazing network by introducing illumination controllable ability, called IC-Deha…
▽ More
On the one hand, the dehazing task is an illposedness problem, which means that no unique solution exists. On the other hand, the dehazing task should take into account the subjective factor, which is to give the user selectable dehazed images rather than a single result. Therefore, this paper proposes a multi-output dehazing network by introducing illumination controllable ability, called IC-Dehazing. The proposed IC-Dehazing can change the illumination intensity by adjusting the factor of the illumination controllable module, which is realized based on the interpretable Retinex theory. Moreover, the backbone dehazing network of IC-Dehazing consists of a Transformer with double decoders for high-quality image restoration. Further, the prior-based loss function and unsupervised training strategy enable IC-Dehazing to complete the parameter learning process without the need for paired data. To demonstrate the effectiveness of the proposed IC-Dehazing, quantitative and qualitative experiments are conducted on image dehazing, semantic segmentation, and object detection tasks. Code is available at https://github.com/Xiaofeng-life/ICDehazing.
△ Less
Submitted 9 June, 2023;
originally announced June 2023.
-
Effective Structured Prompting by Meta-Learning and Representative Verbalizer
Authors:
Weisen Jiang,
Yu Zhang,
James T. Kwok
Abstract:
Prompt tuning for pre-trained masked language models (MLM) has shown promising performance in natural language processing tasks with few labeled examples. It tunes a prompt for the downstream task, and a verbalizer is used to bridge the predicted token and label prediction. Due to the limited training data, prompt initialization is crucial for prompt tuning. Recently, MetaPrompting (Hou et al., 20…
▽ More
Prompt tuning for pre-trained masked language models (MLM) has shown promising performance in natural language processing tasks with few labeled examples. It tunes a prompt for the downstream task, and a verbalizer is used to bridge the predicted token and label prediction. Due to the limited training data, prompt initialization is crucial for prompt tuning. Recently, MetaPrompting (Hou et al., 2022) uses meta-learning to learn a shared initialization for all task-specific prompts. However, a single initialization is insufficient to obtain good prompts for all tasks and samples when the tasks are complex. Moreover, MetaPrompting requires tuning the whole MLM, causing a heavy burden on computation and memory as the MLM is usually large. To address these issues, we use a prompt pool to extract more task knowledge and construct instance-dependent prompts via attention. We further propose a novel soft verbalizer (RepVerb) which constructs label embedding from feature embeddings directly. Combining meta-learning the prompt pool and RepVerb, we propose MetaPrompter for effective structured prompting. MetaPrompter is parameter-efficient as only the pool is required to be tuned. Experimental results demonstrate that MetaPrompter performs better than the recent state-of-the-arts and RepVerb outperforms existing soft verbalizers.
△ Less
Submitted 21 March, 2024; v1 submitted 1 June, 2023;
originally announced June 2023.