-
Accelerating Spectral Clustering under Fairness Constraints
Authors:
Francesco Tonin,
Alex Lambert,
Johan A. K. Suykens,
Volkan Cevher
Abstract:
Fairness of decision-making algorithms is an increasingly important issue. In this paper, we focus on spectral clustering with group fairness constraints, where every demographic group is represented in each cluster proportionally as in the general population. We present a new efficient method for fair spectral clustering (Fair SC) by casting the Fair SC problem within the difference of convex fun…
▽ More
Fairness of decision-making algorithms is an increasingly important issue. In this paper, we focus on spectral clustering with group fairness constraints, where every demographic group is represented in each cluster proportionally as in the general population. We present a new efficient method for fair spectral clustering (Fair SC) by casting the Fair SC problem within the difference of convex functions (DC) framework. To this end, we introduce a novel variable augmentation strategy and employ an alternating direction method of multipliers type of algorithm adapted to DC problems. We show that each associated subproblem can be solved efficiently, resulting in higher computational efficiency compared to prior work, which required a computationally expensive eigendecomposition. Numerical experiments demonstrate the effectiveness of our approach on both synthetic and real-world benchmarks, showing significant speedups in computation time over prior art, especially as the problem size grows. This work thus represents a considerable step forward towards the adoption of fair clustering in real-world applications.
△ Less
Submitted 9 June, 2025;
originally announced June 2025.
-
Generative Kernel Spectral Clustering
Authors:
David Winant,
Sonny Achten,
Johan A. K. Suykens
Abstract:
Modern clustering approaches often trade interpretability for performance, particularly in deep learning-based methods. We present Generative Kernel Spectral Clustering (GenKSC), a novel model combining kernel spectral clustering with generative modeling to produce both well-defined clusters and interpretable representations. By augmenting weighted variance maximization with reconstruction and clu…
▽ More
Modern clustering approaches often trade interpretability for performance, particularly in deep learning-based methods. We present Generative Kernel Spectral Clustering (GenKSC), a novel model combining kernel spectral clustering with generative modeling to produce both well-defined clusters and interpretable representations. By augmenting weighted variance maximization with reconstruction and clustering losses, our model creates an explorable latent space where cluster characteristics can be visualized through traversals along cluster directions. Results on MNIST and FashionMNIST datasets demonstrate the model's ability to learn meaningful cluster representations.
△ Less
Submitted 4 February, 2025;
originally announced February 2025.
-
Learning in Feature Spaces via Coupled Covariances: Asymmetric Kernel SVD and Nyström method
Authors:
Qinghua Tao,
Francesco Tonin,
Alex Lambert,
Yingyi Chen,
Panagiotis Patrinos,
Johan A. K. Suykens
Abstract:
In contrast with Mercer kernel-based approaches as used e.g., in Kernel Principal Component Analysis (KPCA), it was previously shown that Singular Value Decomposition (SVD) inherently relates to asymmetric kernels and Asymmetric Kernel Singular Value Decomposition (KSVD) has been proposed. However, the existing formulation to KSVD cannot work with infinite-dimensional feature mappings, the variati…
▽ More
In contrast with Mercer kernel-based approaches as used e.g., in Kernel Principal Component Analysis (KPCA), it was previously shown that Singular Value Decomposition (SVD) inherently relates to asymmetric kernels and Asymmetric Kernel Singular Value Decomposition (KSVD) has been proposed. However, the existing formulation to KSVD cannot work with infinite-dimensional feature mappings, the variational objective can be unbounded, and needs further numerical evaluation and exploration towards machine learning. In this work, i) we introduce a new asymmetric learning paradigm based on coupled covariance eigenproblem (CCE) through covariance operators, allowing infinite-dimensional feature maps. The solution to CCE is ultimately obtained from the SVD of the induced asymmetric kernel matrix, providing links to KSVD. ii) Starting from the integral equations corresponding to a pair of coupled adjoint eigenfunctions, we formalize the asymmetric Nyström method through a finite sample approximation to speed up training. iii) We provide the first empirical evaluations verifying the practical utility and benefits of KSVD and compare with methods resorting to symmetrization or linear SVD across multiple tasks.
△ Less
Submitted 8 March, 2025; v1 submitted 12 June, 2024;
originally announced June 2024.
-
Learning Analysis of Kernel Ridgeless Regression with Asymmetric Kernel Learning
Authors:
Fan He,
Mingzhen He,
Lei Shi,
Xiaolin Huang,
Johan A. K. Suykens
Abstract:
Ridgeless regression has garnered attention among researchers, particularly in light of the ``Benign Overfitting'' phenomenon, where models interpolating noisy samples demonstrate robust generalization. However, kernel ridgeless regression does not always perform well due to the lack of flexibility. This paper enhances kernel ridgeless regression with Locally-Adaptive-Bandwidths (LAB) RBF kernels,…
▽ More
Ridgeless regression has garnered attention among researchers, particularly in light of the ``Benign Overfitting'' phenomenon, where models interpolating noisy samples demonstrate robust generalization. However, kernel ridgeless regression does not always perform well due to the lack of flexibility. This paper enhances kernel ridgeless regression with Locally-Adaptive-Bandwidths (LAB) RBF kernels, incorporating kernel learning techniques to improve performance in both experiments and theory. For the first time, we demonstrate that functions learned from LAB RBF kernels belong to an integral space of Reproducible Kernel Hilbert Spaces (RKHSs). Despite the absence of explicit regularization in the proposed model, its optimization is equivalent to solving an $\ell_0$-regularized problem in the integral space of RKHSs, elucidating the origin of its generalization ability. Taking an approximation analysis viewpoint, we introduce an $l_q$-norm analysis technique (with $0<q<1$) to derive the learning rate for the proposed model under mild conditions. This result deepens our theoretical understanding, explaining that our algorithm's robust approximation ability arises from the large capacity of the integral space of RKHSs, while its generalization ability is ensured by sparsity, controlled by the number of support vectors. Experimental results on both synthetic and real datasets validate our theoretical conclusions.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
HeNCler: Node Clustering in Heterophilous Graphs via Learned Asymmetric Similarity
Authors:
Sonny Achten,
Zander Op de Beeck,
Francesco Tonin,
Volkan Cevher,
Johan A. K. Suykens
Abstract:
Clustering nodes in heterophilous graphs is challenging as traditional methods assume that effective clustering is characterized by high intra-cluster and low inter-cluster connectivity. To address this, we introduce HeNCler-a novel approach for Heterophilous Node Clustering. HeNCler learns a similarity graph by optimizing a clustering-specific objective based on weighted kernel singular value dec…
▽ More
Clustering nodes in heterophilous graphs is challenging as traditional methods assume that effective clustering is characterized by high intra-cluster and low inter-cluster connectivity. To address this, we introduce HeNCler-a novel approach for Heterophilous Node Clustering. HeNCler learns a similarity graph by optimizing a clustering-specific objective based on weighted kernel singular value decomposition. Our approach enables spectral clustering on an asymmetric similarity graph, providing flexibility for both directed and undirected graphs. By solving the primal problem directly, our method overcomes the computational difficulties of traditional adjacency partitioning-based approaches. Experimental results show that HeNCler significantly improves node clustering performance in heterophilous graph settings, highlighting the advantage of its asymmetric graph-learning framework.
△ Less
Submitted 24 June, 2025; v1 submitted 27 May, 2024;
originally announced May 2024.
-
Sparsity via Sparse Group $k$-max Regularization
Authors:
Qinghua Tao,
Xiangming Xi,
Jun Xu,
Johan A. K. Suykens
Abstract:
For the linear inverse problem with sparsity constraints, the $l_0$ regularized problem is NP-hard, and existing approaches either utilize greedy algorithms to find almost-optimal solutions or to approximate the $l_0$ regularization with its convex counterparts. In this paper, we propose a novel and concise regularization, namely the sparse group $k$-max regularization, which can not only simultan…
▽ More
For the linear inverse problem with sparsity constraints, the $l_0$ regularized problem is NP-hard, and existing approaches either utilize greedy algorithms to find almost-optimal solutions or to approximate the $l_0$ regularization with its convex counterparts. In this paper, we propose a novel and concise regularization, namely the sparse group $k$-max regularization, which can not only simultaneously enhance the group-wise and in-group sparsity, but also casts no additional restraints on the magnitude of variables in each group, which is especially important for variables at different scales, so that it approximate the $l_0$ norm more closely. We also establish an iterative soft thresholding algorithm with local optimality conditions and complexity analysis provided. Through numerical experiments on both synthetic and real-world datasets, we verify the effectiveness and flexibility of the proposed method.
△ Less
Submitted 13 February, 2024;
originally announced February 2024.
-
Self-Attention through Kernel-Eigen Pair Sparse Variational Gaussian Processes
Authors:
Yingyi Chen,
Qinghua Tao,
Francesco Tonin,
Johan A. K. Suykens
Abstract:
While the great capability of Transformers significantly boosts prediction accuracy, it could also yield overconfident predictions and require calibrated uncertainty estimation, which can be commonly tackled by Gaussian processes (GPs). Existing works apply GPs with symmetric kernels under variational inference to the attention kernel; however, omitting the fact that attention kernels are in essen…
▽ More
While the great capability of Transformers significantly boosts prediction accuracy, it could also yield overconfident predictions and require calibrated uncertainty estimation, which can be commonly tackled by Gaussian processes (GPs). Existing works apply GPs with symmetric kernels under variational inference to the attention kernel; however, omitting the fact that attention kernels are in essence asymmetric. Moreover, the complexity of deriving the GP posteriors remains high for large-scale data. In this work, we propose Kernel-Eigen Pair Sparse Variational Gaussian Processes (KEP-SVGP) for building uncertainty-aware self-attention where the asymmetry of attention kernels is tackled by Kernel SVD (KSVD) and a reduced complexity is acquired. Through KEP-SVGP, i) the SVGP pair induced by the two sets of singular vectors from KSVD w.r.t. the attention kernel fully characterizes the asymmetry; ii) using only a small set of adjoint eigenfunctions from KSVD, the derivation of SVGP posteriors can be based on the inversion of a diagonal matrix containing singular values, contributing to a reduction in time complexity; iii) an evidence lower bound is derived so that variational parameters and network weights can be optimized with it. Experiments verify our excellent performances and efficiency on in-distribution, distribution-shift and out-of-distribution benchmarks.
△ Less
Submitted 28 May, 2024; v1 submitted 2 February, 2024;
originally announced February 2024.
-
Can overfitted deep neural networks in adversarial training generalize? -- An approximation viewpoint
Authors:
Zhongjie Shi,
Fanghui Liu,
Yuan Cao,
Johan A. K. Suykens
Abstract:
Adversarial training is a widely used method to improve the robustness of deep neural networks (DNNs) over adversarial perturbations. However, it is empirically observed that adversarial training on over-parameterized networks often suffers from the \textit{robust overfitting}: it can achieve almost zero adversarial training error while the robust generalization performance is not promising. In th…
▽ More
Adversarial training is a widely used method to improve the robustness of deep neural networks (DNNs) over adversarial perturbations. However, it is empirically observed that adversarial training on over-parameterized networks often suffers from the \textit{robust overfitting}: it can achieve almost zero adversarial training error while the robust generalization performance is not promising. In this paper, we provide a theoretical understanding of the question of whether overfitted DNNs in adversarial training can generalize from an approximation viewpoint. Specifically, our main results are summarized into three folds: i) For classification, we prove by construction the existence of infinitely many adversarial training classifiers on over-parameterized DNNs that obtain arbitrarily small adversarial training error (overfitting), whereas achieving good robust generalization error under certain conditions concerning the data quality, well separated, and perturbation level. ii) Linear over-parameterization (meaning that the number of parameters is only slightly larger than the sample size) is enough to ensure such existence if the target function is smooth enough. iii) For regression, our results demonstrate that there also exist infinitely many overfitted DNNs with linear over-parameterization in adversarial training that can achieve almost optimal rates of convergence for the standard generalization error. Overall, our analysis points out that robust overfitting can be avoided but the required model capacity will depend on the smoothness of the target function, while a robust generalization gap is inevitable. We hope our analysis will give a better understanding of the mathematical foundations of robustness in DNNs from an approximation view.
△ Less
Submitted 24 January, 2024;
originally announced January 2024.
-
Nonlinear functional regression by functional deep neural network with kernel embedding
Authors:
Zhongjie Shi,
Jun Fan,
Linhao Song,
Ding-Xuan Zhou,
Johan A. K. Suykens
Abstract:
Recently, deep learning has been widely applied in functional data analysis (FDA) with notable empirical success. However, the infinite dimensionality of functional data necessitates an effective dimension reduction approach for functional learning tasks, particularly in nonlinear functional regression. In this paper, we introduce a functional deep neural network with an adaptive and discretizatio…
▽ More
Recently, deep learning has been widely applied in functional data analysis (FDA) with notable empirical success. However, the infinite dimensionality of functional data necessitates an effective dimension reduction approach for functional learning tasks, particularly in nonlinear functional regression. In this paper, we introduce a functional deep neural network with an adaptive and discretization-invariant dimension reduction method. Our functional network architecture consists of three parts: first, a kernel embedding step that features an integral transformation with an adaptive smooth kernel; next, a projection step that utilizes eigenfunction bases based on a projection Mercer kernel for the dimension reduction; and finally, a deep ReLU neural network is employed for the prediction. Explicit rates of approximating nonlinear smooth functionals across various input function spaces by our proposed functional network are derived. Additionally, we conduct a generalization analysis for the empirical risk minimization (ERM) algorithm applied to our functional net, by employing a novel two-stage oracle inequality and the established functional approximation results. Ultimately, we conduct numerical experiments on both simulated and real datasets to demonstrate the effectiveness and benefits of our functional net.
△ Less
Submitted 12 May, 2025; v1 submitted 5 January, 2024;
originally announced January 2024.
-
Enhancing Kernel Flexibility via Learning Asymmetric Locally-Adaptive Kernels
Authors:
Fan He,
Mingzhen He,
Lei Shi,
Xiaolin Huang,
Johan A. K. Suykens
Abstract:
The lack of sufficient flexibility is the key bottleneck of kernel-based learning that relies on manually designed, pre-given, and non-trainable kernels. To enhance kernel flexibility, this paper introduces the concept of Locally-Adaptive-Bandwidths (LAB) as trainable parameters to enhance the Radial Basis Function (RBF) kernel, giving rise to the LAB RBF kernel. The parameters in LAB RBF kernels…
▽ More
The lack of sufficient flexibility is the key bottleneck of kernel-based learning that relies on manually designed, pre-given, and non-trainable kernels. To enhance kernel flexibility, this paper introduces the concept of Locally-Adaptive-Bandwidths (LAB) as trainable parameters to enhance the Radial Basis Function (RBF) kernel, giving rise to the LAB RBF kernel. The parameters in LAB RBF kernels are data-dependent, and its number can increase with the dataset, allowing for better adaptation to diverse data patterns and enhancing the flexibility of the learned function. This newfound flexibility also brings challenges, particularly with regards to asymmetry and the need for an efficient learning algorithm. To address these challenges, this paper for the first time establishes an asymmetric kernel ridge regression framework and introduces an iterative kernel learning algorithm. This novel approach not only reduces the demand for extensive support data but also significantly improves generalization by training bandwidths on the available training data. Experimental results on real datasets underscore the remarkable performance of the proposed algorithm, showcasing its superior capability in handling large-scale datasets compared to Nyström approximation-based algorithms. Moreover, it demonstrates a significant improvement in regression accuracy over existing kernel-based learning methods and even surpasses residual neural networks.
△ Less
Submitted 8 October, 2023;
originally announced October 2023.
-
Low-Rank Multitask Learning based on Tensorized SVMs and LSSVMs
Authors:
Jiani Liu,
Qinghua Tao,
Ce Zhu,
Yipeng Liu,
Xiaolin Huang,
Johan A. K. Suykens
Abstract:
Multitask learning (MTL) leverages task-relatedness to enhance performance. With the emergence of multimodal data, tasks can now be referenced by multiple indices. In this paper, we employ high-order tensors, with each mode corresponding to a task index, to naturally represent tasks referenced by multiple indices and preserve their structural relations. Based on this representation, we propose a g…
▽ More
Multitask learning (MTL) leverages task-relatedness to enhance performance. With the emergence of multimodal data, tasks can now be referenced by multiple indices. In this paper, we employ high-order tensors, with each mode corresponding to a task index, to naturally represent tasks referenced by multiple indices and preserve their structural relations. Based on this representation, we propose a general framework of low-rank MTL methods with tensorized support vector machines (SVMs) and least square support vector machines (LSSVMs), where the CP factorization is deployed over the coefficient tensor. Our approach allows to model the task relation through a linear combination of shared factors weighted by task-specific factors and is generalized to both classification and regression problems. Through the alternating optimization scheme and the Lagrangian function, each subproblem is transformed into a convex problem, formulated as a quadratic programming or linear system in the dual form. In contrast to previous MTL frameworks, our decision function in the dual induces a weighted kernel function with a task-coupling term characterized by the similarities of the task-specific factors, better revealing the explicit relations across tasks in MTL. Experimental results validate the effectiveness and superiority of our proposed methods compared to existing state-of-the-art approaches in MTL. The code of implementation will be available at https://github.com/liujiani0216/TSVM-MTL.
△ Less
Submitted 30 August, 2023;
originally announced August 2023.
-
A Dual Formulation for Probabilistic Principal Component Analysis
Authors:
Henri De Plaen,
Johan A. K. Suykens
Abstract:
In this paper, we characterize Probabilistic Principal Component Analysis in Hilbert spaces and demonstrate how the optimal solution admits a representation in dual space. This allows us to develop a generative framework for kernel methods. Furthermore, we show how it englobes Kernel Principal Component Analysis and illustrate its working on a toy and a real dataset.
In this paper, we characterize Probabilistic Principal Component Analysis in Hilbert spaces and demonstrate how the optimal solution admits a representation in dual space. This allows us to develop a generative framework for kernel methods. Furthermore, we show how it englobes Kernel Principal Component Analysis and illustrate its working on a toy and a real dataset.
△ Less
Submitted 19 July, 2023;
originally announced July 2023.
-
Unbalanced Optimal Transport: A Unified Framework for Object Detection
Authors:
Henri De Plaen,
Pierre-François De Plaen,
Johan A. K. Suykens,
Marc Proesmans,
Tinne Tuytelaars,
Luc Van Gool
Abstract:
During training, supervised object detection tries to correctly match the predicted bounding boxes and associated classification scores to the ground truth. This is essential to determine which predictions are to be pushed towards which solutions, or to be discarded. Popular matching strategies include matching to the closest ground truth box (mostly used in combination with anchors), or matching…
▽ More
During training, supervised object detection tries to correctly match the predicted bounding boxes and associated classification scores to the ground truth. This is essential to determine which predictions are to be pushed towards which solutions, or to be discarded. Popular matching strategies include matching to the closest ground truth box (mostly used in combination with anchors), or matching via the Hungarian algorithm (mostly used in anchor-free methods). Each of these strategies comes with its own properties, underlying losses, and heuristics. We show how Unbalanced Optimal Transport unifies these different approaches and opens a whole continuum of methods in between. This allows for a finer selection of the desired properties. Experimentally, we show that training an object detection model with Unbalanced Optimal Transport is able to reach the state-of-the-art both in terms of Average Precision and Average Recall as well as to provide a faster initial convergence. The approach is well suited for GPU implementation, which proves to be an advantage for large-scale models.
△ Less
Submitted 5 July, 2023;
originally announced July 2023.
-
Nonlinear SVD with Asymmetric Kernels: feature learning and asymmetric Nyström method
Authors:
Qinghua Tao,
Francesco Tonin,
Panagiotis Patrinos,
Johan A. K. Suykens
Abstract:
Asymmetric data naturally exist in real life, such as directed graphs. Different from the common kernel methods requiring Mercer kernels, this paper tackles the asymmetric kernel-based learning problem. We describe a nonlinear extension of the matrix Singular Value Decomposition through asymmetric kernels, namely KSVD. First, we construct two nonlinear feature mappings w.r.t. rows and columns of t…
▽ More
Asymmetric data naturally exist in real life, such as directed graphs. Different from the common kernel methods requiring Mercer kernels, this paper tackles the asymmetric kernel-based learning problem. We describe a nonlinear extension of the matrix Singular Value Decomposition through asymmetric kernels, namely KSVD. First, we construct two nonlinear feature mappings w.r.t. rows and columns of the given data matrix. The proposed optimization problem maximizes the variance of each mapping projected onto the subspace spanned by the other, subject to a mutual orthogonality constraint. Through Lagrangian duality, we show that it can be solved by the left and right singular vectors in the feature space induced by the asymmetric kernel. Moreover, we start from the integral equations with a pair of adjoint eigenfunctions corresponding to the singular vectors on an asymmetrical kernel, and extend the Nyström method to asymmetric cases through the finite sample approximation, which can be applied to speedup the training in KSVD. Experiments show that asymmetric KSVD learns features outperforming Mercer-kernel based methods that resort to symmetrization, and also verify the effectiveness of the asymmetric Nyström method.
△ Less
Submitted 12 June, 2023;
originally announced June 2023.
-
Combining Primal and Dual Representations in Deep Restricted Kernel Machines Classifiers
Authors:
Francesco Tonin,
Panagiotis Patrinos,
Johan A. K. Suykens
Abstract:
In the context of deep learning with kernel machines, the deep Restricted Kernel Machine (DRKM) framework allows multiple levels of kernel PCA (KPCA) and Least-Squares Support Vector Machines (LSSVM) to be combined into a deep architecture using visible and hidden units. We propose a new method for DRKM classification coupling the objectives of KPCA and classification levels, with the hidden featu…
▽ More
In the context of deep learning with kernel machines, the deep Restricted Kernel Machine (DRKM) framework allows multiple levels of kernel PCA (KPCA) and Least-Squares Support Vector Machines (LSSVM) to be combined into a deep architecture using visible and hidden units. We propose a new method for DRKM classification coupling the objectives of KPCA and classification levels, with the hidden feature matrix lying on the Stiefel manifold. The classification level can be formulated as an LSSVM or as an MLP feature map, combining depth in terms of levels and layers. The classification level is expressed in its primal formulation, as the deep KPCA levels, in their dual formulation, can embed the most informative components of the data in a much lower dimensional space. The dual setting is independent of the dimension of the inputs and the primal setting is parametric, which makes the proposed method computationally efficient for both high-dimensional inputs and large datasets. In the experiments, we show that our developed algorithm can effectively learn from small datasets, while using less memory than the convolutional neural network (CNN) with high-dimensional data. and that models with multiple KPCA levels can outperform models with a single level. On the tested larger-scale datasets, DRKM is more energy efficient than CNN while maintaining comparable performance.
△ Less
Submitted 29 August, 2023; v1 submitted 12 June, 2023;
originally announced June 2023.
-
Extending Kernel PCA through Dualization: Sparsity, Robustness and Fast Algorithms
Authors:
Francesco Tonin,
Alex Lambert,
Panagiotis Patrinos,
Johan A. K. Suykens
Abstract:
The goal of this paper is to revisit Kernel Principal Component Analysis (KPCA) through dualization of a difference of convex functions. This allows to naturally extend KPCA to multiple objective functions and leads to efficient gradient-based algorithms avoiding the expensive SVD of the Gram matrix. Particularly, we consider objective functions that can be written as Moreau envelopes, demonstrati…
▽ More
The goal of this paper is to revisit Kernel Principal Component Analysis (KPCA) through dualization of a difference of convex functions. This allows to naturally extend KPCA to multiple objective functions and leads to efficient gradient-based algorithms avoiding the expensive SVD of the Gram matrix. Particularly, we consider objective functions that can be written as Moreau envelopes, demonstrating how to promote robustness and sparsity within the same framework. The proposed method is evaluated on synthetic and real-world benchmarks, showing significant speedup in KPCA training time as well as highlighting the benefits in terms of robustness and sparsity.
△ Less
Submitted 9 June, 2023;
originally announced June 2023.
-
Primal-Attention: Self-attention through Asymmetric Kernel SVD in Primal Representation
Authors:
Yingyi Chen,
Qinghua Tao,
Francesco Tonin,
Johan A. K. Suykens
Abstract:
Recently, a new line of works has emerged to understand and improve self-attention in Transformers by treating it as a kernel machine. However, existing works apply the methods for symmetric kernels to the asymmetric self-attention, resulting in a nontrivial gap between the analytical understanding and numerical implementation. In this paper, we provide a new perspective to represent and optimize…
▽ More
Recently, a new line of works has emerged to understand and improve self-attention in Transformers by treating it as a kernel machine. However, existing works apply the methods for symmetric kernels to the asymmetric self-attention, resulting in a nontrivial gap between the analytical understanding and numerical implementation. In this paper, we provide a new perspective to represent and optimize self-attention through asymmetric Kernel Singular Value Decomposition (KSVD), which is also motivated by the low-rank property of self-attention normally observed in deep layers. Through asymmetric KSVD, $i$) a primal-dual representation of self-attention is formulated, where the optimization objective is cast to maximize the projection variances in the attention outputs; $ii$) a novel attention mechanism, i.e., Primal-Attention, is proposed via the primal representation of KSVD, avoiding explicit computation of the kernel matrix in the dual; $iii$) with KKT conditions, we prove that the stationary solution to the KSVD optimization in Primal-Attention yields a zero-value objective. In this manner, KSVD optimization can be implemented by simply minimizing a regularization loss, so that low-rank property is promoted without extra decomposition. Numerical experiments show state-of-the-art performance of our Primal-Attention with improved efficiency. Moreover, we demonstrate that the deployed KSVD optimization regularizes Primal-Attention with a sharper singular value decay than that of the canonical self-attention, further verifying the great potential of our method. To the best of our knowledge, this is the first work that provides a primal-dual representation for the asymmetric kernel in self-attention and successfully applies it to modeling and optimization.
△ Less
Submitted 5 December, 2023; v1 submitted 31 May, 2023;
originally announced May 2023.
-
Duality in Multi-View Restricted Kernel Machines
Authors:
Sonny Achten,
Arun Pandey,
Hannes De Meulemeester,
Bart De Moor,
Johan A. K. Suykens
Abstract:
We propose a unifying setting that combines existing restricted kernel machine methods into a single primal-dual multi-view framework for kernel principal component analysis in both supervised and unsupervised settings. We derive the primal and dual representations of the framework and relate different training and inference algorithms from a theoretical perspective. We show how to achieve full eq…
▽ More
We propose a unifying setting that combines existing restricted kernel machine methods into a single primal-dual multi-view framework for kernel principal component analysis in both supervised and unsupervised settings. We derive the primal and dual representations of the framework and relate different training and inference algorithms from a theoretical perspective. We show how to achieve full equivalence in primal and dual formulations by rescaling primal variables. Finally, we experimentally validate the equivalence and provide insight into the relationships between different methods on a number of time series data sets by recursively forecasting unseen test data and visualizing the learned features.
△ Less
Submitted 6 July, 2023; v1 submitted 26 May, 2023;
originally announced May 2023.
-
Tensorized LSSVMs for Multitask Regression
Authors:
Jiani Liu,
Qinghua Tao,
Ce Zhu,
Yipeng Liu,
Johan A. K. Suykens
Abstract:
Multitask learning (MTL) can utilize the relatedness between multiple tasks for performance improvement. The advent of multimodal data allows tasks to be referenced by multiple indices. High-order tensors are capable of providing efficient representations for such tasks, while preserving structural task-relations. In this paper, a new MTL method is proposed by leveraging low-rank tensor analysis a…
▽ More
Multitask learning (MTL) can utilize the relatedness between multiple tasks for performance improvement. The advent of multimodal data allows tasks to be referenced by multiple indices. High-order tensors are capable of providing efficient representations for such tasks, while preserving structural task-relations. In this paper, a new MTL method is proposed by leveraging low-rank tensor analysis and constructing tensorized Least Squares Support Vector Machines, namely the tLSSVM-MTL, where multilinear modelling and its nonlinear extensions can be flexibly exerted. We employ a high-order tensor for all the weights with each mode relating to an index and factorize it with CP decomposition, assigning a shared factor for all tasks and retaining task-specific latent factors along each index. Then an alternating algorithm is derived for the nonconvex optimization, where each resulting subproblem is solved by a linear system. Experimental results demonstrate promising performances of our tLSSVM-MTL.
△ Less
Submitted 4 March, 2023;
originally announced March 2023.
-
Deep Kernel Principal Component Analysis for Multi-level Feature Learning
Authors:
Francesco Tonin,
Qinghua Tao,
Panagiotis Patrinos,
Johan A. K. Suykens
Abstract:
Principal Component Analysis (PCA) and its nonlinear extension Kernel PCA (KPCA) are widely used across science and industry for data analysis and dimensionality reduction. Modern deep learning tools have achieved great empirical success, but a framework for deep principal component analysis is still lacking. Here we develop a deep kernel PCA methodology (DKPCA) to extract multiple levels of the m…
▽ More
Principal Component Analysis (PCA) and its nonlinear extension Kernel PCA (KPCA) are widely used across science and industry for data analysis and dimensionality reduction. Modern deep learning tools have achieved great empirical success, but a framework for deep principal component analysis is still lacking. Here we develop a deep kernel PCA methodology (DKPCA) to extract multiple levels of the most informative components of the data. Our scheme can effectively identify new hierarchical variables, called deep principal components, capturing the main characteristics of high-dimensional data through a simple and interpretable numerical optimization. We couple the principal components of multiple KPCA levels, theoretically showing that DKPCA creates both forward and backward dependency across levels, which has not been explored in kernel methods and yet is crucial to extract more informative features. Various experimental evaluations on multiple data types show that DKPCA finds more efficient and disentangled representations with higher explained variance in fewer principal components, compared to the shallow KPCA. We demonstrate that our method allows for effective hierarchical data exploration, with the ability to separate the key generative factors of the input data both for large datasets and when few training samples are available. Overall, DKPCA can facilitate the extraction of useful patterns from high-dimensional data by learning more informative features organized in different levels, giving diversified aspects to explore the variation factors in the data, while maintaining a simple mathematical formulation.
△ Less
Submitted 22 February, 2023;
originally announced February 2023.
-
Unsupervised Neighborhood Propagation Kernel Layers for Semi-supervised Node Classification
Authors:
Sonny Achten,
Francesco Tonin,
Panagiotis Patrinos,
Johan A. K. Suykens
Abstract:
We present a deep Graph Convolutional Kernel Machine (GCKM) for semi-supervised node classification in graphs. The method is built of two main types of blocks: (i) We introduce unsupervised kernel machine layers propagating the node features in a one-hop neighborhood, using implicit node feature mappings. (ii) We specify a semi-supervised classification kernel machine through the lens of the Fench…
▽ More
We present a deep Graph Convolutional Kernel Machine (GCKM) for semi-supervised node classification in graphs. The method is built of two main types of blocks: (i) We introduce unsupervised kernel machine layers propagating the node features in a one-hop neighborhood, using implicit node feature mappings. (ii) We specify a semi-supervised classification kernel machine through the lens of the Fenchel-Young inequality. We derive an effective initialization scheme and efficient end-to-end training algorithm in the dual variables for the full architecture. The main idea underlying GCKM is that, because of the unsupervised core, the final model can achieve higher performance in semi-supervised node classification when few labels are available for training. Experimental results demonstrate the effectiveness of the proposed framework.
△ Less
Submitted 15 December, 2023; v1 submitted 31 January, 2023;
originally announced January 2023.
-
Multi-view Kernel PCA for Time series Forecasting
Authors:
Arun Pandey,
Hannes De Meulemeester,
Bart De Moor,
Johan A. K. Suykens
Abstract:
In this paper, we propose a kernel principal component analysis model for multi-variate time series forecasting, where the training and prediction schemes are derived from the multi-view formulation of Restricted Kernel Machines. The training problem is simply an eigenvalue decomposition of the summation of two kernel matrices corresponding to the views of the input and output data. When a linear…
▽ More
In this paper, we propose a kernel principal component analysis model for multi-variate time series forecasting, where the training and prediction schemes are derived from the multi-view formulation of Restricted Kernel Machines. The training problem is simply an eigenvalue decomposition of the summation of two kernel matrices corresponding to the views of the input and output data. When a linear kernel is used for the output view, it is shown that the forecasting equation takes the form of kernel ridge regression. When that kernel is non-linear, a pre-image problem has to be solved to forecast a point in the input space. We evaluate the model on several standard time series datasets, perform ablation studies, benchmark with closely related models and discuss its results.
△ Less
Submitted 23 January, 2023;
originally announced January 2023.
-
Jigsaw-ViT: Learning Jigsaw Puzzles in Vision Transformer
Authors:
Yingyi Chen,
Xi Shen,
Yahui Liu,
Qinghua Tao,
Johan A. K. Suykens
Abstract:
The success of Vision Transformer (ViT) in various computer vision tasks has promoted the ever-increasing prevalence of this convolution-free network. The fact that ViT works on image patches makes it potentially relevant to the problem of jigsaw puzzle solving, which is a classical self-supervised task aiming at reordering shuffled sequential image patches back to their natural form. Despite its…
▽ More
The success of Vision Transformer (ViT) in various computer vision tasks has promoted the ever-increasing prevalence of this convolution-free network. The fact that ViT works on image patches makes it potentially relevant to the problem of jigsaw puzzle solving, which is a classical self-supervised task aiming at reordering shuffled sequential image patches back to their natural form. Despite its simplicity, solving jigsaw puzzle has been demonstrated to be helpful for diverse tasks using Convolutional Neural Networks (CNNs), such as self-supervised feature representation learning, domain generalization, and fine-grained classification.
In this paper, we explore solving jigsaw puzzle as a self-supervised auxiliary loss in ViT for image classification, named Jigsaw-ViT. We show two modifications that can make Jigsaw-ViT superior to standard ViT: discarding positional embeddings and masking patches randomly. Yet simple, we find that Jigsaw-ViT is able to improve both in generalization and robustness over the standard ViT, which is usually rather a trade-off. Experimentally, we show that adding the jigsaw puzzle branch provides better generalization than ViT on large-scale image classification on ImageNet. Moreover, the auxiliary task also improves robustness to noisy labels on Animal-10N, Food-101N, and Clothing1M as well as adversarial examples. Our implementation is available at https://yingyichen-cyy.github.io/Jigsaw-ViT/.
△ Less
Submitted 5 January, 2023; v1 submitted 25 July, 2022;
originally announced July 2022.
-
Tensor-based Multi-view Spectral Clustering via Shared Latent Space
Authors:
Qinghua Tao,
Francesco Tonin,
Panagiotis Patrinos,
Johan A. K. Suykens
Abstract:
Multi-view Spectral Clustering (MvSC) attracts increasing attention due to diverse data sources. However, most existing works are prohibited in out-of-sample predictions and overlook model interpretability and exploration of clustering results. In this paper, a new method for MvSC is proposed via a shared latent space from the Restricted Kernel Machine framework. Through the lens of conjugate feat…
▽ More
Multi-view Spectral Clustering (MvSC) attracts increasing attention due to diverse data sources. However, most existing works are prohibited in out-of-sample predictions and overlook model interpretability and exploration of clustering results. In this paper, a new method for MvSC is proposed via a shared latent space from the Restricted Kernel Machine framework. Through the lens of conjugate feature duality, we cast the weighted kernel principal component analysis problem for MvSC and develop a modified weighted conjugate feature duality to formulate dual variables. In our method, the dual variables, playing the role of hidden features, are shared by all views to construct a common latent space, coupling the views by learning projections from view-specific spaces. Such single latent space promotes well-separated clusters and provides straightforward data exploration, facilitating visualization and interpretation. Our method requires only a single eigendecomposition, whose dimension is independent of the number of views. To boost higher-order correlations, tensor-based modelling is introduced without increasing computational complexity. Our method can be flexibly applied with out-of-sample extensions, enabling greatly improved efficiency for large-scale data with fixed-size kernel schemes. Numerical experiments verify that our method is effective regarding accuracy, efficiency, and interpretability, showing a sharp eigenvalue decay and distinct latent variable distributions.
△ Less
Submitted 23 July, 2022;
originally announced July 2022.
-
Compressing Features for Learning with Noisy Labels
Authors:
Yingyi Chen,
Shell Xu Hu,
Xi Shen,
Chunrong Ai,
Johan A. K. Suykens
Abstract:
Supervised learning can be viewed as distilling relevant information from input data into feature representations. This process becomes difficult when supervision is noisy as the distilled information might not be relevant. In fact, recent research shows that networks can easily overfit all labels including those that are corrupted, and hence can hardly generalize to clean datasets. In this paper,…
▽ More
Supervised learning can be viewed as distilling relevant information from input data into feature representations. This process becomes difficult when supervision is noisy as the distilled information might not be relevant. In fact, recent research shows that networks can easily overfit all labels including those that are corrupted, and hence can hardly generalize to clean datasets. In this paper, we focus on the problem of learning with noisy labels and introduce compression inductive bias to network architectures to alleviate this over-fitting problem. More precisely, we revisit one classical regularization named Dropout and its variant Nested Dropout. Dropout can serve as a compression constraint for its feature dropping mechanism, while Nested Dropout further learns ordered feature representations w.r.t. feature importance. Moreover, the trained models with compression regularization are further combined with Co-teaching for performance boost.
Theoretically, we conduct bias-variance decomposition of the objective function under compression regularization. We analyze it for both single model and Co-teaching. This decomposition provides three insights: (i) it shows that over-fitting is indeed an issue for learning with noisy labels; (ii) through an information bottleneck formulation, it explains why the proposed feature compression helps in combating label noise; (iii) it gives explanations on the performance boost brought by incorporating compression regularization into Co-teaching. Experiments show that our simple approach can have comparable or even better performance than the state-of-the-art methods on benchmarks with real-world label noise including Clothing1M and ANIMAL-10N. Our implementation is available at https://yingyichen-cyy.github.io/CompressFeatNoisyLabels/.
△ Less
Submitted 27 June, 2022;
originally announced June 2022.
-
Piecewise Linear Neural Networks and Deep Learning
Authors:
Qinghua Tao,
Li Li,
Xiaolin Huang,
Xiangming Xi,
Shuning Wang,
Johan A. K. Suykens
Abstract:
As a powerful modelling method, PieceWise Linear Neural Networks (PWLNNs) have proven successful in various fields, most recently in deep learning. To apply PWLNN methods, both the representation and the learning have long been studied. In 1977, the canonical representation pioneered the works of shallow PWLNNs learned by incremental designs, but the applications to large-scale data were prohibite…
▽ More
As a powerful modelling method, PieceWise Linear Neural Networks (PWLNNs) have proven successful in various fields, most recently in deep learning. To apply PWLNN methods, both the representation and the learning have long been studied. In 1977, the canonical representation pioneered the works of shallow PWLNNs learned by incremental designs, but the applications to large-scale data were prohibited. In 2010, the Rectified Linear Unit (ReLU) advocated the prevalence of PWLNNs in deep learning. Ever since, PWLNNs have been successfully applied to extensive tasks and achieved advantageous performances. In this Primer, we systematically introduce the methodology of PWLNNs by grouping the works into shallow and deep networks. Firstly, different PWLNN representation models are constructed with elaborated examples. With PWLNNs, the evolution of learning algorithms for data is presented and fundamental theoretical analysis follows up for in-depth understandings. Then, representative applications are introduced together with discussions and outlooks.
△ Less
Submitted 18 June, 2022;
originally announced June 2022.
-
Learning with Asymmetric Kernels: Least Squares and Feature Interpretation
Authors:
Mingzhen He,
Fan He,
Lei Shi,
Xiaolin Huang,
Johan A. K. Suykens
Abstract:
Asymmetric kernels naturally exist in real life, e.g., for conditional probability and directed graphs. However, most of the existing kernel-based learning methods require kernels to be symmetric, which prevents the use of asymmetric kernels. This paper addresses the asymmetric kernel-based learning in the framework of the least squares support vector machine named AsK-LS, resulting in the first c…
▽ More
Asymmetric kernels naturally exist in real life, e.g., for conditional probability and directed graphs. However, most of the existing kernel-based learning methods require kernels to be symmetric, which prevents the use of asymmetric kernels. This paper addresses the asymmetric kernel-based learning in the framework of the least squares support vector machine named AsK-LS, resulting in the first classification method that can utilize asymmetric kernels directly. We will show that AsK-LS can learn with asymmetric features, namely source and target features, while the kernel trick remains applicable, i.e., the source and target features exist but are not necessarily known. Besides, the computational burden of AsK-LS is as cheap as dealing with symmetric kernels. Experimental results on the Corel database, directed graphs, and the UCI database will show that in the case asymmetric information is crucial, the proposed AsK-LS can learn with asymmetric kernels and performs much better than the existing kernel methods that have to do symmetrization to accommodate asymmetric kernels.
△ Less
Submitted 2 February, 2022;
originally announced February 2022.
-
Tensor Network Kalman Filtering for Large-Scale LS-SVMs
Authors:
Maximilian Lucassen,
Johan A. K. Suykens,
Kim Batselier
Abstract:
Least squares support vector machines are a commonly used supervised learning method for nonlinear regression and classification. They can be implemented in either their primal or dual form. The latter requires solving a linear system, which can be advantageous as an explicit mapping of the data to a possibly infinite-dimensional feature space is avoided. However, for large-scale applications, cur…
▽ More
Least squares support vector machines are a commonly used supervised learning method for nonlinear regression and classification. They can be implemented in either their primal or dual form. The latter requires solving a linear system, which can be advantageous as an explicit mapping of the data to a possibly infinite-dimensional feature space is avoided. However, for large-scale applications, current low-rank approximation methods can perform inadequately. For example, current methods are probabilistic due to their sampling procedures, and/or suffer from a poor trade-off between the ranks and approximation power. In this paper, a recursive Bayesian filtering framework based on tensor networks and the Kalman filter is presented to alleviate the demanding memory and computational complexities associated with solving large-scale dual problems. The proposed method is iterative, does not require explicit storage of the kernel matrix, and allows the formulation of early stopping conditions. Additionally, the framework yields confidence estimates of obtained models, unlike alternative methods. The performance is tested on two regression and three classification experiments, and compared to the Nyström and fixed size LS-SVM methods. Results show that our method can achieve high performance and is particularly useful when alternative methods are computationally infeasible due to a slowly decaying kernel matrix spectrum.
△ Less
Submitted 26 October, 2021;
originally announced October 2021.
-
On the Double Descent of Random Features Models Trained with SGD
Authors:
Fanghui Liu,
Johan A. K. Suykens,
Volkan Cevher
Abstract:
We study generalization properties of random features (RF) regression in high dimensions optimized by stochastic gradient descent (SGD) in under-/over-parameterized regime. In this work, we derive precise non-asymptotic error bounds of RF regression under both constant and polynomial-decay step-size SGD setting, and observe the double descent phenomenon both theoretically and empirically. Our anal…
▽ More
We study generalization properties of random features (RF) regression in high dimensions optimized by stochastic gradient descent (SGD) in under-/over-parameterized regime. In this work, we derive precise non-asymptotic error bounds of RF regression under both constant and polynomial-decay step-size SGD setting, and observe the double descent phenomenon both theoretically and empirically. Our analysis shows how to cope with multiple randomness sources of initialization, label noise, and data sampling (as well as stochastic gradients) with no closed-form solution, and also goes beyond the commonly-used Gaussian/spherical data assumption. Our theoretical results demonstrate that, with SGD training, RF regression still generalizes well for interpolation learning, and is able to characterize the double descent behavior by the unimodality of variance and monotonic decrease of bias. Besides, we also prove that the constant step-size SGD setting incurs no loss in convergence rate when compared to the exact minimum-norm interpolator, as a theoretical justification of using SGD in practice.
△ Less
Submitted 16 October, 2022; v1 submitted 13 October, 2021;
originally announced October 2021.
-
Latent Space Exploration Using Generative Kernel PCA
Authors:
David Winant,
Joachim Schreurs,
Johan A. K. Suykens
Abstract:
Kernel PCA is a powerful feature extractor which recently has seen a reformulation in the context of Restricted Kernel Machines (RKMs). These RKMs allow for a representation of kernel PCA in terms of hidden and visible units similar to Restricted Boltzmann Machines. This connection has led to insights on how to use kernel PCA in a generative procedure, called generative kernel PCA. In this paper,…
▽ More
Kernel PCA is a powerful feature extractor which recently has seen a reformulation in the context of Restricted Kernel Machines (RKMs). These RKMs allow for a representation of kernel PCA in terms of hidden and visible units similar to Restricted Boltzmann Machines. This connection has led to insights on how to use kernel PCA in a generative procedure, called generative kernel PCA. In this paper, the use of generative kernel PCA for exploring latent spaces of datasets is investigated. New points can be generated by gradually moving in the latent space, which allows for an interpretation of the components. Firstly, examples of this feature space exploration on three datasets are shown with one of them leading to an interpretable representation of ECG signals. Afterwards, the use of the tool in combination with novelty detection is shown, where the latent space around novel patterns in the data is explored. This helps in the interpretation of why certain points are considered as novel.
△ Less
Submitted 28 May, 2021;
originally announced May 2021.
-
Towards Deterministic Diverse Subset Sampling
Authors:
Joachim Schreurs,
Michaël Fanuel,
Johan A. K. Suykens
Abstract:
Determinantal point processes (DPPs) are well known models for diverse subset selection problems, including recommendation tasks, document summarization and image search. In this paper, we discuss a greedy deterministic adaptation of k-DPP. Deterministic algorithms are interesting for many applications, as they provide interpretability to the user by having no failure probability and always return…
▽ More
Determinantal point processes (DPPs) are well known models for diverse subset selection problems, including recommendation tasks, document summarization and image search. In this paper, we discuss a greedy deterministic adaptation of k-DPP. Deterministic algorithms are interesting for many applications, as they provide interpretability to the user by having no failure probability and always returning the same results. First, the ability of the method to yield low-rank approximations of kernel matrices is evaluated by comparing the accuracy of the Nyström approximation on multiple datasets. Afterwards, we demonstrate the usefulness of the model on an image search task.
△ Less
Submitted 28 May, 2021;
originally announced May 2021.
-
Boosting Co-teaching with Compression Regularization for Label Noise
Authors:
Yingyi Chen,
Xi Shen,
Shell Xu Hu,
Johan A. K. Suykens
Abstract:
In this paper, we study the problem of learning image classification models in the presence of label noise. We revisit a simple compression regularization named Nested Dropout. We find that Nested Dropout, though originally proposed to perform fast information retrieval and adaptive data compression, can properly regularize a neural network to combat label noise. Moreover, owing to its simplicity,…
▽ More
In this paper, we study the problem of learning image classification models in the presence of label noise. We revisit a simple compression regularization named Nested Dropout. We find that Nested Dropout, though originally proposed to perform fast information retrieval and adaptive data compression, can properly regularize a neural network to combat label noise. Moreover, owing to its simplicity, it can be easily combined with Co-teaching to further boost the performance.
Our final model remains simple yet effective: it achieves comparable or even better performance than the state-of-the-art approaches on two real-world datasets with label noise which are Clothing1M and ANIMAL-10N. On Clothing1M, our approach obtains 74.9% accuracy which is slightly better than that of DivideMix. On ANIMAL-10N, we achieve 84.1% accuracy while the best public result by PLC is 83.4%. We hope that our simple approach can be served as a strong baseline for learning with label noise. Our implementation is available at https://github.com/yingyichen-cyy/Nested-Co-teaching.
△ Less
Submitted 28 April, 2021;
originally announced April 2021.
-
Leverage Score Sampling for Complete Mode Coverage in Generative Adversarial Networks
Authors:
Joachim Schreurs,
Hannes De Meulemeester,
Michaël Fanuel,
Bart De Moor,
Johan A. K. Suykens
Abstract:
Commonly, machine learning models minimize an empirical expectation. As a result, the trained models typically perform well for the majority of the data but the performance may deteriorate in less dense regions of the dataset. This issue also arises in generative modeling. A generative model may overlook underrepresented modes that are less frequent in the empirical data distribution. This problem…
▽ More
Commonly, machine learning models minimize an empirical expectation. As a result, the trained models typically perform well for the majority of the data but the performance may deteriorate in less dense regions of the dataset. This issue also arises in generative modeling. A generative model may overlook underrepresented modes that are less frequent in the empirical data distribution. This problem is known as complete mode coverage. We propose a sampling procedure based on ridge leverage scores which significantly improves mode coverage when compared to standard methods and can easily be combined with any GAN. Ridge leverage scores are computed by using an explicit feature map, associated with the next-to-last layer of a GAN discriminator or of a pre-trained network, or by using an implicit feature map corresponding to a Gaussian kernel. Multiple evaluations against recent approaches of complete mode coverage show a clear improvement when using the proposed sampling strategy.
△ Less
Submitted 21 July, 2021; v1 submitted 6 April, 2021;
originally announced April 2021.
-
Unsupervised Energy-based Out-of-distribution Detection using Stiefel-Restricted Kernel Machine
Authors:
Francesco Tonin,
Arun Pandey,
Panagiotis Patrinos,
Johan A. K. Suykens
Abstract:
Detecting out-of-distribution (OOD) samples is an essential requirement for the deployment of machine learning systems in the real world. Until now, research on energy-based OOD detectors has focused on the softmax confidence score from a pre-trained neural network classifier with access to class labels. In contrast, we propose an unsupervised energy-based OOD detector leveraging the Stiefel-Restr…
▽ More
Detecting out-of-distribution (OOD) samples is an essential requirement for the deployment of machine learning systems in the real world. Until now, research on energy-based OOD detectors has focused on the softmax confidence score from a pre-trained neural network classifier with access to class labels. In contrast, we propose an unsupervised energy-based OOD detector leveraging the Stiefel-Restricted Kernel Machine (St-RKM). Training requires minimizing an objective function with an autoencoder loss term and the RKM energy where the interconnection matrix lies on the Stiefel manifold. Further, we outline multiple energy function definitions based on the RKM framework and discuss their utility. In the experiments on standard datasets, the proposed method improves over the existing energy-based OOD detectors and deep generative models. Through several ablation studies, we further illustrate the merit of each proposed energy function on the OOD detection performance.
△ Less
Submitted 16 February, 2021;
originally announced February 2021.
-
Unsupervised learning of disentangled representations in deep restricted kernel machines with orthogonality constraints
Authors:
Francesco Tonin,
Panagiotis Patrinos,
Johan A. K. Suykens
Abstract:
We introduce Constr-DRKM, a deep kernel method for the unsupervised learning of disentangled data representations. We propose augmenting the original deep restricted kernel machine formulation for kernel PCA by orthogonality constraints on the latent variables to promote disentanglement and to make it possible to carry out optimization without first defining a stabilized objective. After illustrat…
▽ More
We introduce Constr-DRKM, a deep kernel method for the unsupervised learning of disentangled data representations. We propose augmenting the original deep restricted kernel machine formulation for kernel PCA by orthogonality constraints on the latent variables to promote disentanglement and to make it possible to carry out optimization without first defining a stabilized objective. After illustrating an end-to-end training procedure based on a quadratic penalty optimization algorithm with warm start, we quantitatively evaluate the proposed method's effectiveness in disentangled feature learning. We demonstrate on four benchmark datasets that this approach performs similarly overall to $β$-VAE on a number of disentanglement metrics when few training points are available, while being less sensitive to randomness and hyperparameter selection than $β$-VAE. We also present a deterministic initialization of Constr-DRKM's training algorithm that significantly improves the reproducibility of the results. Finally, we empirically evaluate and discuss the role of the number of layers in the proposed methodology, examining the influence of each principal component in every layer and showing that components in lower layers act as local feature detectors capturing the broad trends of the data distribution, while components in deeper layers use the representation learned by previous layers and more accurately reproduce higher-level features.
△ Less
Submitted 25 November, 2020;
originally announced November 2020.
-
Determinantal Point Processes Implicitly Regularize Semi-parametric Regression Problems
Authors:
Michaël Fanuel,
Joachim Schreurs,
Johan A. K. Suykens
Abstract:
Semi-parametric regression models are used in several applications which require comprehensibility without sacrificing accuracy. Typical examples are spline interpolation in geophysics, or non-linear time series problems, where the system includes a linear and non-linear component. We discuss here the use of a finite Determinantal Point Process (DPP) for approximating semi-parametric models. Recen…
▽ More
Semi-parametric regression models are used in several applications which require comprehensibility without sacrificing accuracy. Typical examples are spline interpolation in geophysics, or non-linear time series problems, where the system includes a linear and non-linear component. We discuss here the use of a finite Determinantal Point Process (DPP) for approximating semi-parametric models. Recently, Barthelmé, Tremblay, Usevich, and Amblard introduced a novel representation of some finite DPPs. These authors formulated extended L-ensembles that can conveniently represent partial-projection DPPs and suggest their use for optimal interpolation. With the help of this formalism, we derive a key identity illustrating the implicit regularization effect of determinantal sampling for semi-parametric regression and interpolation. Also, a novel projected Nyström approximation is defined and used to derive a bound on the expected risk for the corresponding approximation of semi-parametric regression. This work naturally extends similar results obtained for kernel ridge regression.
△ Less
Submitted 9 March, 2021; v1 submitted 13 November, 2020;
originally announced November 2020.
-
Towards a Unified Quadrature Framework for Large-Scale Kernel Machines
Authors:
Fanghui Liu,
Xiaolin Huang,
Yudong Chen,
Johan A. K. Suykens
Abstract:
In this paper, we develop a quadrature framework for large-scale kernel machines via a numerical integration representation. Considering that the integration domain and measure of typical kernels, e.g., Gaussian kernels, arc-cosine kernels, are fully symmetric, we leverage deterministic fully symmetric interpolatory rules to efficiently compute quadrature nodes and associated weights for kernel ap…
▽ More
In this paper, we develop a quadrature framework for large-scale kernel machines via a numerical integration representation. Considering that the integration domain and measure of typical kernels, e.g., Gaussian kernels, arc-cosine kernels, are fully symmetric, we leverage deterministic fully symmetric interpolatory rules to efficiently compute quadrature nodes and associated weights for kernel approximation. The developed interpolatory rules are able to reduce the number of needed nodes while retaining a high approximation accuracy. Further, we randomize the above deterministic rules by the classical Monte-Carlo sampling and control variates techniques with two merits: 1) The proposed stochastic rules make the dimension of the feature mapping flexibly varying, such that we can control the discrepancy between the original and approximate kernels by tuning the dimnension. 2) Our stochastic rules have nice statistical properties of unbiasedness and variance reduction with fast convergence rate. In addition, we elucidate the relationship between our deterministic/stochastic interpolatory rules and current quadrature rules for kernel approximation, including the sparse grids quadrature and stochastic spherical-radial rules, thereby unifying these methods under our framework. Experimental results on several benchmark datasets show that our methods compare favorably with other representative kernel approximation based methods.
△ Less
Submitted 10 June, 2021; v1 submitted 3 November, 2020;
originally announced November 2020.
-
Kernel regression in high dimensions: Refined analysis beyond double descent
Authors:
Fanghui Liu,
Zhenyu Liao,
Johan A. K. Suykens
Abstract:
In this paper, we provide a precise characterization of generalization properties of high dimensional kernel ridge regression across the under- and over-parameterized regimes, depending on whether the number of training data n exceeds the feature dimension d. By establishing a bias-variance decomposition of the expected excess risk, we show that, while the bias is (almost) independent of d and mon…
▽ More
In this paper, we provide a precise characterization of generalization properties of high dimensional kernel ridge regression across the under- and over-parameterized regimes, depending on whether the number of training data n exceeds the feature dimension d. By establishing a bias-variance decomposition of the expected excess risk, we show that, while the bias is (almost) independent of d and monotonically decreases with n, the variance depends on n, d and can be unimodal or monotonically decreasing under different regularization schemes. Our refined analysis goes beyond the double descent theory by showing that, depending on the data eigen-profile and the level of regularization, the kernel regression risk curve can be a double-descent-like, bell-shaped, or monotonic function of n. Experiments on synthetic and real data are conducted to support our theoretical findings.
△ Less
Submitted 23 February, 2021; v1 submitted 6 October, 2020;
originally announced October 2020.
-
Outlier detection in non-elliptical data by kernel MRCD
Authors:
Joachim Schreurs,
Iwein Vranckx,
Mia Hubert,
Johan A. K. Suykens,
Peter J. Rousseeuw
Abstract:
The minimum regularized covariance determinant method (MRCD) is a robust estimator for multivariate location and scatter, which detects outliers by fitting a robust covariance matrix to the data. Its regularization ensures that the covariance matrix is well-conditioned in any dimension. The MRCD assumes that the non-outlying observations are roughly elliptically distributed, but many datasets are…
▽ More
The minimum regularized covariance determinant method (MRCD) is a robust estimator for multivariate location and scatter, which detects outliers by fitting a robust covariance matrix to the data. Its regularization ensures that the covariance matrix is well-conditioned in any dimension. The MRCD assumes that the non-outlying observations are roughly elliptically distributed, but many datasets are not of that form. Moreover, the computation time of MRCD increases substantially when the number of variables goes up, and nowadays datasets with many variables are common. The proposed Kernel Minimum Regularized Covariance Determinant (KMRCD) estimator addresses both issues. It is not restricted to elliptical data because it implicitly computes the MRCD estimates in a kernel induced feature space. A fast algorithm is constructed that starts from kernel-based initial estimates and exploits the kernel trick to speed up the subsequent computations. Based on the KMRCD estimates, a rule is proposed to flag outliers. The KMRCD algorithm performs well in simulations, and is illustrated on real-life data.
△ Less
Submitted 29 March, 2021; v1 submitted 5 August, 2020;
originally announced August 2020.
-
A Theoretical Framework for Target Propagation
Authors:
Alexander Meulemans,
Francesco S. Carzaniga,
Johan A. K. Suykens,
João Sacramento,
Benjamin F. Grewe
Abstract:
The success of deep learning, a brain-inspired form of AI, has sparked interest in understanding how the brain could similarly learn across multiple layers of neurons. However, the majority of biologically-plausible learning algorithms have not yet reached the performance of backpropagation (BP), nor are they built on strong theoretical foundations. Here, we analyze target propagation (TP), a popu…
▽ More
The success of deep learning, a brain-inspired form of AI, has sparked interest in understanding how the brain could similarly learn across multiple layers of neurons. However, the majority of biologically-plausible learning algorithms have not yet reached the performance of backpropagation (BP), nor are they built on strong theoretical foundations. Here, we analyze target propagation (TP), a popular but not yet fully understood alternative to BP, from the standpoint of mathematical optimization. Our theory shows that TP is closely related to Gauss-Newton optimization and thus substantially differs from BP. Furthermore, our analysis reveals a fundamental limitation of difference target propagation (DTP), a well-known variant of TP, in the realistic scenario of non-invertible neural networks. We provide a first solution to this problem through a novel reconstruction loss that improves feedback weight training, while simultaneously introducing architectural flexibility by allowing for direct feedback connections from the output to each hidden layer. Our theory is corroborated by experimental results that show significant improvements in performance and in the alignment of forward weight updates with loss gradients, compared to DTP.
△ Less
Submitted 16 December, 2020; v1 submitted 25 June, 2020;
originally announced June 2020.
-
Ensemble Kernel Methods, Implicit Regularization and Determinantal Point Processes
Authors:
Joachim Schreurs,
Michaël Fanuel,
Johan A. K. Suykens
Abstract:
By using the framework of Determinantal Point Processes (DPPs), some theoretical results concerning the interplay between diversity and regularization can be obtained. In this paper we show that sampling subsets with kDPPs results in implicit regularization in the context of ridgeless Kernel Regression. Furthermore, we leverage the common setup of state-of-the-art DPP algorithms to sample multiple…
▽ More
By using the framework of Determinantal Point Processes (DPPs), some theoretical results concerning the interplay between diversity and regularization can be obtained. In this paper we show that sampling subsets with kDPPs results in implicit regularization in the context of ridgeless Kernel Regression. Furthermore, we leverage the common setup of state-of-the-art DPP algorithms to sample multiple small subsets and use them in an ensemble of ridgeless regressions. Our first empirical results indicate that ensemble of ridgeless regressors can be interesting to use for datasets including redundant information.
△ Less
Submitted 7 July, 2020; v1 submitted 24 June, 2020;
originally announced June 2020.
-
The Bures Metric for Generative Adversarial Networks
Authors:
Hannes De Meulemeester,
Joachim Schreurs,
Michaël Fanuel,
Bart De Moor,
Johan A. K. Suykens
Abstract:
Generative Adversarial Networks (GANs) are performant generative methods yielding high-quality samples. However, under certain circumstances, the training of GANs can lead to mode collapse or mode dropping, i.e. the generative models not being able to sample from the entire probability distribution. To address this problem, we use the last layer of the discriminator as a feature map to study the d…
▽ More
Generative Adversarial Networks (GANs) are performant generative methods yielding high-quality samples. However, under certain circumstances, the training of GANs can lead to mode collapse or mode dropping, i.e. the generative models not being able to sample from the entire probability distribution. To address this problem, we use the last layer of the discriminator as a feature map to study the distribution of the real and the fake data. During training, we propose to match the real batch diversity to the fake batch diversity by using the Bures distance between covariance matrices in feature space. The computation of the Bures distance can be conveniently done in either feature space or kernel space in terms of the covariance and kernel matrix respectively. We observe that diversity matching reduces mode collapse substantially and has a positive effect on the sample quality. On the practical side, a very simple training procedure, that does not require additional hyperparameter tuning, is proposed and assessed on several datasets.
△ Less
Submitted 27 April, 2021; v1 submitted 16 June, 2020;
originally announced June 2020.
-
Disentangled Representation Learning and Generation with Manifold Optimization
Authors:
Arun Pandey,
Michael Fanuel,
Joachim Schreurs,
Johan A. K. Suykens
Abstract:
Disentanglement is a useful property in representation learning which increases the interpretability of generative models such as Variational autoencoders (VAE), Generative Adversarial Models, and their many variants. Typically in such models, an increase in disentanglement performance is traded-off with generation quality. In the context of latent space models, this work presents a representation…
▽ More
Disentanglement is a useful property in representation learning which increases the interpretability of generative models such as Variational autoencoders (VAE), Generative Adversarial Models, and their many variants. Typically in such models, an increase in disentanglement performance is traded-off with generation quality. In the context of latent space models, this work presents a representation learning framework that explicitly promotes disentanglement by encouraging orthogonal directions of variations. The proposed objective is the sum of an autoencoder error term along with a Principal Component Analysis reconstruction error in the feature space. This has an interpretation of a Restricted Kernel Machine with the eigenvector matrix-valued on the Stiefel manifold. Our analysis shows that such a construction promotes disentanglement by matching the principal directions in the latent space with the directions of orthogonal variation in data space. In an alternating minimization scheme, we use Cayley ADAM algorithm - a stochastic optimization method on the Stiefel manifold along with the ADAM optimizer. Our theoretical discussion and various experiments show that the proposed model improves over many VAE variants in terms of both generation quality and disentangled representation learning.
△ Less
Submitted 30 May, 2022; v1 submitted 12 June, 2020;
originally announced June 2020.
-
Analysis of Regularized Least Squares in Reproducing Kernel Krein Spaces
Authors:
Fanghui Liu,
Lei Shi,
Xiaolin Huang,
Jie Yang,
Johan A. K. Suykens
Abstract:
In this paper, we study the asymptotic properties of regularized least squares with indefinite kernels in reproducing kernel Krein spaces (RKKS). By introducing a bounded hyper-sphere constraint to such non-convex regularized risk minimization problem, we theoretically demonstrate that this problem has a globally optimal solution with a closed form on the sphere, which makes approximation analysis…
▽ More
In this paper, we study the asymptotic properties of regularized least squares with indefinite kernels in reproducing kernel Krein spaces (RKKS). By introducing a bounded hyper-sphere constraint to such non-convex regularized risk minimization problem, we theoretically demonstrate that this problem has a globally optimal solution with a closed form on the sphere, which makes approximation analysis feasible in RKKS. Regarding to the original regularizer induced by the indefinite inner product, we modify traditional error decomposition techniques, prove convergence results for the introduced hypothesis error based on matrix perturbation theory, and derive learning rates of such regularized regression problem in RKKS. Under some conditions, the derived learning rates in RKKS are the same as that in reproducing kernel Hilbert spaces (RKHS), which is actually the first work on approximation analysis of regularized learning algorithms in RKKS.
△ Less
Submitted 24 November, 2020; v1 submitted 1 June, 2020;
originally announced June 2020.
-
Fast Learning in Reproducing Kernel Krein Spaces via Signed Measures
Authors:
Fanghui Liu,
Xiaolin Huang,
Yingyi Chen,
Johan A. K. Suykens
Abstract:
In this paper, we attempt to solve a long-lasting open question for non-positive definite (non-PD) kernels in machine learning community: can a given non-PD kernel be decomposed into the difference of two PD kernels (termed as positive decomposition)? We cast this question as a distribution view by introducing the \emph{signed measure}, which transforms positive decomposition to measure decomposit…
▽ More
In this paper, we attempt to solve a long-lasting open question for non-positive definite (non-PD) kernels in machine learning community: can a given non-PD kernel be decomposed into the difference of two PD kernels (termed as positive decomposition)? We cast this question as a distribution view by introducing the \emph{signed measure}, which transforms positive decomposition to measure decomposition: a series of non-PD kernels can be associated with the linear combination of specific finite Borel measures. In this manner, our distribution-based framework provides a sufficient and necessary condition to answer this open question. Specifically, this solution is also computationally implementable in practice to scale non-PD kernels in large sample cases, which allows us to devise the first random features algorithm to obtain an unbiased estimator. Experimental results on several benchmark datasets verify the effectiveness of our algorithm over the existing methods.
△ Less
Submitted 9 February, 2021; v1 submitted 30 May, 2020;
originally announced June 2020.
-
Random Features for Kernel Approximation: A Survey on Algorithms, Theory, and Beyond
Authors:
Fanghui Liu,
Xiaolin Huang,
Yudong Chen,
Johan A. K. Suykens
Abstract:
Random features is one of the most popular techniques to speed up kernel methods in large-scale problems. Related works have been recognized by the NeurIPS Test-of-Time award in 2017 and the ICML Best Paper Finalist in 2019. The body of work on random features has grown rapidly, and hence it is desirable to have a comprehensive overview on this topic explaining the connections among various algori…
▽ More
Random features is one of the most popular techniques to speed up kernel methods in large-scale problems. Related works have been recognized by the NeurIPS Test-of-Time award in 2017 and the ICML Best Paper Finalist in 2019. The body of work on random features has grown rapidly, and hence it is desirable to have a comprehensive overview on this topic explaining the connections among various algorithms and theoretical results. In this survey, we systematically review the work on random features from the past ten years. First, the motivations, characteristics and contributions of representative random features based algorithms are summarized according to their sampling schemes, learning procedures, variance reduction properties and how they exploit training data. Second, we review theoretical results that center around the following key question: how many random features are needed to ensure a high approximation quality or no loss in the empirical/expected risks of the learned estimator. Third, we provide a comprehensive evaluation of popular random features based algorithms on several large-scale benchmark datasets and discuss their approximation quality and prediction performance for classification. Last, we discuss the relationship between random features and modern over-parameterized deep neural networks (DNNs), including the use of high dimensional random features in the analysis of DNNs as well as the gaps between current theoretical and empirical results. This survey may serve as a gentle introduction to this topic, and as a users' guide for practitioners interested in applying the representative algorithms and understanding theoretical results under various technical assumptions. We hope that this survey will facilitate discussion on the open problems in this topic, and more importantly, shed light on future research directions.
△ Less
Submitted 11 July, 2021; v1 submitted 23 April, 2020;
originally announced April 2020.
-
Diversity sampling is an implicit regularization for kernel methods
Authors:
Michaël Fanuel,
Joachim Schreurs,
Johan A. K. Suykens
Abstract:
Kernel methods have achieved very good performance on large scale regression and classification problems, by using the Nyström method and preconditioning techniques. The Nyström approximation -- based on a subset of landmarks -- gives a low rank approximation of the kernel matrix, and is known to provide a form of implicit regularization. We further elaborate on the impact of sampling diverse land…
▽ More
Kernel methods have achieved very good performance on large scale regression and classification problems, by using the Nyström method and preconditioning techniques. The Nyström approximation -- based on a subset of landmarks -- gives a low rank approximation of the kernel matrix, and is known to provide a form of implicit regularization. We further elaborate on the impact of sampling diverse landmarks for constructing the Nyström approximation in supervised as well as unsupervised kernel methods. By using Determinantal Point Processes for sampling, we obtain additional theoretical results concerning the interplay between diversity and regularization. Empirically, we demonstrate the advantages of training kernel methods based on subsets made of diverse points. In particular, if the dataset has a dense bulk and a sparser tail, we show that Nyström kernel regression with diverse landmarks increases the accuracy of the regression in sparser regions of the dataset, with respect to a uniform landmark sampling. A greedy heuristic is also proposed to select diverse samples of significant size within large datasets when exact DPP sampling is not practically feasible.
△ Less
Submitted 20 February, 2020;
originally announced February 2020.
-
Wasserstein Exponential Kernels
Authors:
Henri De Plaen,
Michaël Fanuel,
Johan A. K. Suykens
Abstract:
In the context of kernel methods, the similarity between data points is encoded by the kernel function which is often defined thanks to the Euclidean distance, a common example being the squared exponential kernel. Recently, other distances relying on optimal transport theory - such as the Wasserstein distance between probability distributions - have shown their practical relevance for different m…
▽ More
In the context of kernel methods, the similarity between data points is encoded by the kernel function which is often defined thanks to the Euclidean distance, a common example being the squared exponential kernel. Recently, other distances relying on optimal transport theory - such as the Wasserstein distance between probability distributions - have shown their practical relevance for different machine learning techniques. In this paper, we study the use of exponential kernels defined thanks to the regularized Wasserstein distance and discuss their positive definiteness. More specifically, we define Wasserstein feature maps and illustrate their interest for supervised learning problems involving shapes and images. Empirically, Wasserstein squared exponential kernels are shown to yield smaller classification errors on small training sets of shapes, compared to analogous classifiers using Euclidean distances.
△ Less
Submitted 5 February, 2020;
originally announced February 2020.
-
Robust Generative Restricted Kernel Machines using Weighted Conjugate Feature Duality
Authors:
Arun Pandey,
Joachim Schreurs,
Johan A. K. Suykens
Abstract:
Interest in generative models has grown tremendously in the past decade. However, their training performance can be adversely affected by contamination, where outliers are encoded in the representation of the model. This results in the generation of noisy data. In this paper, we introduce weighted conjugate feature duality in the framework of Restricted Kernel Machines (RKMs). The RKM formulation…
▽ More
Interest in generative models has grown tremendously in the past decade. However, their training performance can be adversely affected by contamination, where outliers are encoded in the representation of the model. This results in the generation of noisy data. In this paper, we introduce weighted conjugate feature duality in the framework of Restricted Kernel Machines (RKMs). The RKM formulation allows for an easy integration of methods from classical robust statistics. This formulation is used to fine-tune the latent space of generative RKMs using a weighting function based on the Minimum Covariance Determinant, which is a highly robust estimator of multivariate location and scatter. Experiments show that the weighted RKM is capable of generating clean images when contamination is present in the training data. We further show that the robust method also preserves uncorrelated feature learning through qualitative and quantitative experiments on standard datasets.
△ Less
Submitted 23 June, 2020; v1 submitted 4 February, 2020;
originally announced February 2020.
-
Random Fourier Features via Fast Surrogate Leverage Weighted Sampling
Authors:
Fanghui Liu,
Xiaolin Huang,
Yudong Chen,
Jie Yang,
Johan A. K. Suykens
Abstract:
In this paper, we propose a fast surrogate leverage weighted sampling strategy to generate refined random Fourier features for kernel approximation. Compared to the current state-of-the-art method that uses the leverage weighted scheme [Li-ICML2019], our new strategy is simpler and more effective. It uses kernel alignment to guide the sampling process and it can avoid the matrix inversion operator…
▽ More
In this paper, we propose a fast surrogate leverage weighted sampling strategy to generate refined random Fourier features for kernel approximation. Compared to the current state-of-the-art method that uses the leverage weighted scheme [Li-ICML2019], our new strategy is simpler and more effective. It uses kernel alignment to guide the sampling process and it can avoid the matrix inversion operator when we compute the leverage function. Given n observations and s random features, our strategy can reduce the time complexity from O(ns^2+s^3) to O(ns^2), while achieving comparable (or even slightly better) prediction performance when applied to kernel ridge regression (KRR). In addition, we provide theoretical guarantees on the generalization performance of our approach, and in particular characterize the number of random features required to achieve statistical guarantees in KRR. Experiments on several benchmark datasets demonstrate that our algorithm achieves comparable prediction performance and takes less time cost when compared to [Li-ICML2019].
△ Less
Submitted 20 November, 2019;
originally announced November 2019.