-
Regularizing cross entropy loss via minimum entropy and K-L divergence
Authors:
Abdulrahman Oladipupo Ibraheem
Abstract:
I introduce two novel loss functions for classification in deep learning. The two loss functions extend standard cross entropy loss by regularizing it with minimum entropy and Kullback-Leibler (K-L) divergence terms. The first of the two novel loss functions is termed mixed entropy loss (MIX-ENT for short), while the second one is termed minimum entropy regularized cross-entropy loss (MIN-ENT for…
▽ More
I introduce two novel loss functions for classification in deep learning. The two loss functions extend standard cross entropy loss by regularizing it with minimum entropy and Kullback-Leibler (K-L) divergence terms. The first of the two novel loss functions is termed mixed entropy loss (MIX-ENT for short), while the second one is termed minimum entropy regularized cross-entropy loss (MIN-ENT for short). The MIX-ENT function introduces a regularizer that can be shown to be equivalent to the sum of a minimum entropy term and a K-L divergence term. However, it should be noted that the K-L divergence term here is different from that in the standard cross-entropy loss function, in the sense that it swaps the roles of the target probability and the hypothesis probability. The MIN-ENT function simply adds a minimum entropy regularizer to the standard cross entropy loss function. In both MIX-ENT and MIN-ENT, the minimum entropy regularizer minimizes the entropy of the hypothesis probability distribution which is output by the neural network. Experiments on the EMNIST-Letters dataset shows that my implementation of MIX-ENT and MIN-ENT lets the VGG model climb from its previous 3rd position on the paperswithcode leaderboard to reach the 2nd position on the leaderboard, outperforming the Spinal-VGG model in so doing. Specifically, using standard cross-entropy, VGG achieves 95.86% while Spinal-VGG achieves 95.88% classification accuracies, whereas using VGG (without Spinal-VGG) our MIN-ENT achieved 95.933%, while our MIX-ENT achieved 95.927% accuracies. The pre-trained models for both MIX-ENT and MIN-ENT are at https://github.com/rahmanoladi/minimum entropy project.
△ Less
Submitted 23 January, 2025;
originally announced January 2025.
-
Compact Shape Trees: A Contribution to the Forest of Shape Correspondences and Matching Methods
Authors:
Abdulrahman Oladipupo Ibraheem
Abstract:
We propose a novel technique, termed compact shape trees, for computing correspondences of single-boundary 2-D shapes in O(n2) time. Together with zero or more features defined at each of n sample points on the shape's boundary, the compact shape tree of a shape comprises the O(n) collection of vectors emanating from any of the sample points on the shape's boundary to the rest of the sample points…
▽ More
We propose a novel technique, termed compact shape trees, for computing correspondences of single-boundary 2-D shapes in O(n2) time. Together with zero or more features defined at each of n sample points on the shape's boundary, the compact shape tree of a shape comprises the O(n) collection of vectors emanating from any of the sample points on the shape's boundary to the rest of the sample points on the boundary. As it turns out, compact shape trees have a number of elegant properties both in the spatial and frequency domains. In particular, via a simple vector-algebraic argument, we show that the O(n) collection of vectors in a compact shape tree possesses at least the same discriminatory power as the O(n2) collection of lines emanating from each sample point to every other sample point on a shape's boundary. In addition, we describe neat approaches for achieving scale and rotation invariance with compact shape trees in the spatial domain; by viewing compact shape trees as aperiodic discrete signals, we also prove scale and rotation invariance properties for them in the Fourier domain. Towards these, along the way, using concepts from differential geometry and the Calculus, we propose a novel theory for sampling 2-D shape boundaries in a scale and rotation invariant manner. Finally, we propose a number of shape recognition experiments to test the efficacy of our concept.
△ Less
Submitted 9 June, 2015;
originally announced June 2015.
-
Bi-directional Shape Correspondences (BSC): A Novel Technique for 2-d Shape Warping in Quadratic Time?
Authors:
Abdulrahman Oladipupo Ibraheem
Abstract:
We propose Bidirectional Shape Correspondence (BSC) as a possible improvement on the famous shape contexts (SC) framework. Our proposals derive from the observation that the SC framework enforces a one-to-one correspondence between sample points, and that this leads to two possible drawbacks. First, this denies the framework the opportunity to effect advantageous many-to-many matching between poin…
▽ More
We propose Bidirectional Shape Correspondence (BSC) as a possible improvement on the famous shape contexts (SC) framework. Our proposals derive from the observation that the SC framework enforces a one-to-one correspondence between sample points, and that this leads to two possible drawbacks. First, this denies the framework the opportunity to effect advantageous many-to-many matching between points on the two shapes being compared. Second, this calls for the Hungarian algorithm which unfortunately usurps cubic time. While the dynamic-space-warping dynamic programming algorithm has provided a standard solution to the first problem above, it demands quintic time for general multi-contour shapes, and w times quadratic time for the special case of single-contour shapes, even after an heuristic search window of width w has been chosen. Therefore, in this work, we propose a simple method for computing "many-to-many" correspondences for the class of all 2-d shapes in quadratic time. Our approach is to explicitly let each point on the first shape choose a best match on the second shape, and vice versa. Along the way, we also propose the use of data-clustering techniques for dealing with the outliers problem, and, from another viewpoint, it turns out that this clustering can be seen as an autonomous, rather than pre-computed, sampling of shape boundary.
△ Less
Submitted 21 December, 2014;
originally announced December 2014.
-
Correlation of Data Reconstruction Error and Shrinkages in Pair-wise Distances under Principal Component Analysis (PCA)
Authors:
Abdulrahman Oladipupo Ibraheem
Abstract:
In this on-going work, I explore certain theoretical and empirical implications of data transformations under the PCA. In particular, I state and prove three theorems about PCA, which I paraphrase as follows: 1). PCA without discarding eigenvector rows is injective, but looses this injectivity when eigenvector rows are discarded 2). PCA without discarding eigen- vector rows preserves pair-wise dis…
▽ More
In this on-going work, I explore certain theoretical and empirical implications of data transformations under the PCA. In particular, I state and prove three theorems about PCA, which I paraphrase as follows: 1). PCA without discarding eigenvector rows is injective, but looses this injectivity when eigenvector rows are discarded 2). PCA without discarding eigen- vector rows preserves pair-wise distances, but tends to cause pair-wise distances to shrink when eigenvector rows are discarded. 3). For any pair of points, the shrinkage in pair-wise distance is bounded above by an L1 norm reconstruction error associated with the points. Clearly, 3). suggests that there might exist some correlation between shrinkages in pair-wise distances and mean square reconstruction error which is defined as the sum of those eigenvalues associated with the discarded eigenvectors. I therefore decided to perform numerical experiments to obtain the corre- lation between the sum of those eigenvalues and shrinkages in pair-wise distances. In addition, I have also performed some experiments to check respectively the effect of the sum of those eigenvalues and the effect of the shrinkages on classification accuracies under the PCA map. So far, I have obtained the following results on some publicly available data from the UCI Machine Learning Repository: 1). There seems to be a strong cor- relation between the sum of those eigenvalues associated with discarded eigenvectors and shrinkages in pair-wise distances. 2). Neither the sum of those eigenvalues nor pair-wise distances have any strong correlations with classification accuracies. 1
△ Less
Submitted 21 December, 2014;
originally announced December 2014.
-
SENNS: Sparse Extraction Neural NetworkS for Feature Extraction
Authors:
Abdulrahman Oladipupo Ibraheem
Abstract:
By drawing on ideas from optimisation theory, artificial neural networks (ANN), graph embeddings and sparse representations, I develop a novel technique, termed SENNS (Sparse Extraction Neural NetworkS), aimed at addressing the feature extraction problem. The proposed method uses (preferably deep) ANNs for projecting input attribute vectors to an output space wherein pairwise distances are maximiz…
▽ More
By drawing on ideas from optimisation theory, artificial neural networks (ANN), graph embeddings and sparse representations, I develop a novel technique, termed SENNS (Sparse Extraction Neural NetworkS), aimed at addressing the feature extraction problem. The proposed method uses (preferably deep) ANNs for projecting input attribute vectors to an output space wherein pairwise distances are maximized for vectors belonging to different classes, but minimized for those belonging to the same class, while simultaneously enforcing sparsity on the ANN outputs. The vectors that result from the projection can then be used as features in any classifier of choice. Mathematically, I formulate the proposed method as the minimisation of an objective function which can be interpreted, in the ANN output space, as a negative factor of the sum of the squares of the pair-wise distances between output vectors belonging to different classes, added to a positive factor of the sum of squares of the pair-wise distances between output vectors belonging to the same classes, plus sparsity and weight decay terms. To derive an algorithm for minimizing the objective function via gradient descent, I use the multi-variate version of the chain rule to obtain the partial derivatives of the function with respect to ANN weights and biases, and find that each of the required partial derivatives can be expressed as a sum of six terms. As it turns out, four of those six terms can be computed using the standard back propagation algorithm; the fifth can be computed via a slight modification of the standard backpropagation algorithm; while the sixth one can be computed via simple arithmetic. Finally, I propose experiments on the ARABASE Arabic corpora of digits and letters, the CMU PIE database of faces, the MNIST digits database, and other standard machine learning databases.
△ Less
Submitted 21 December, 2014;
originally announced December 2014.