-
Efficiently Disentangle Causal Representations
Authors:
Yuanpeng Li,
Joel Hestness,
Mohamed Elhoseiny,
Liang Zhao,
Kenneth Church
Abstract:
This paper proposes an efficient approach to learning disentangled representations with causal mechanisms based on the difference of conditional probabilities in original and new distributions. We approximate the difference with models' generalization abilities so that it fits in the standard machine learning framework and can be efficiently computed. In contrast to the state-of-the-art approach,…
▽ More
This paper proposes an efficient approach to learning disentangled representations with causal mechanisms based on the difference of conditional probabilities in original and new distributions. We approximate the difference with models' generalization abilities so that it fits in the standard machine learning framework and can be efficiently computed. In contrast to the state-of-the-art approach, which relies on the learner's adaptation speed to new distribution, the proposed approach only requires evaluating the model's generalization ability. We provide a theoretical explanation for the advantage of the proposed method, and our experiments show that the proposed technique is 1.9--11.0$\times$ more sample efficient and 9.4--32.4 times quicker than the previous method on various tasks. The source code is available at \url{https://github.com/yuanpeng16/EDCR}.
△ Less
Submitted 1 January, 2024; v1 submitted 6 January, 2022;
originally announced January 2022.
-
Temporal Positive-unlabeled Learning for Biomedical Hypothesis Generation via Risk Estimation
Authors:
Uchenna Akujuobi,
Jun Chen,
Mohamed Elhoseiny,
Michael Spranger,
Xiangliang Zhang
Abstract:
Understanding the relationships between biomedical terms like viruses, drugs, and symptoms is essential in the fight against diseases. Many attempts have been made to introduce the use of machine learning to the scientific process of hypothesis generation(HG), which refers to the discovery of meaningful implicit connections between biomedical terms. However, most existing methods fail to truly cap…
▽ More
Understanding the relationships between biomedical terms like viruses, drugs, and symptoms is essential in the fight against diseases. Many attempts have been made to introduce the use of machine learning to the scientific process of hypothesis generation(HG), which refers to the discovery of meaningful implicit connections between biomedical terms. However, most existing methods fail to truly capture the temporal dynamics of scientific term relations and also assume unobserved connections to be irrelevant (i.e., in a positive-negative (PN) learning setting). To break these limits, we formulate this HG problem as future connectivity prediction task on a dynamic attributed graph via positive-unlabeled (PU) learning. Then, the key is to capture the temporal evolution of node pair (term pair) relations from just the positive and unlabeled data. We propose a variational inference model to estimate the positive prior, and incorporate it in the learning of node pair embeddings, which are then used for link prediction. Experiment results on real-world biomedical term relationship datasets and case study analyses on a COVID-19 dataset validate the effectiveness of the proposed model.
△ Less
Submitted 5 October, 2020;
originally announced October 2020.
-
Class Normalization for (Continual)? Generalized Zero-Shot Learning
Authors:
Ivan Skorokhodov,
Mohamed Elhoseiny
Abstract:
Normalization techniques have proved to be a crucial ingredient of successful training in a traditional supervised learning regime. However, in the zero-shot learning (ZSL) world, these ideas have received only marginal attention. This work studies normalization in ZSL scenario from both theoretical and practical perspectives. First, we give a theoretical explanation to two popular tricks used in…
▽ More
Normalization techniques have proved to be a crucial ingredient of successful training in a traditional supervised learning regime. However, in the zero-shot learning (ZSL) world, these ideas have received only marginal attention. This work studies normalization in ZSL scenario from both theoretical and practical perspectives. First, we give a theoretical explanation to two popular tricks used in zero-shot learning: normalize+scale and attributes normalization and show that they help training by preserving variance during a forward pass. Next, we demonstrate that they are insufficient to normalize a deep ZSL model and propose Class Normalization (CN): a normalization scheme, which alleviates this issue both provably and in practice. Third, we show that ZSL models typically have more irregular loss surface compared to traditional classifiers and that the proposed method partially remedies this problem. Then, we test our approach on 4 standard ZSL datasets and outperform sophisticated modern SotA with a simple MLP optimized without any bells and whistles and having ~50 times faster training speed. Finally, we generalize ZSL to a broader problem -- continual ZSL, and introduce some principled metrics and rigorous baselines for this new setup. The project page is located at https://universome.github.io/class-norm.
△ Less
Submitted 14 April, 2021; v1 submitted 19 June, 2020;
originally announced June 2020.
-
Inner Ensemble Networks: Average Ensemble as an Effective Regularizer
Authors:
Abduallah Mohamed,
Muhammed Mohaimin Sadiq,
Ehab AlBadawy,
Mohamed Elhoseiny,
Christian Claudel
Abstract:
We introduce Inner Ensemble Networks (IENs) which reduce the variance within the neural network itself without an increase in the model complexity. IENs utilize ensemble parameters during the training phase to reduce the network variance. While in the testing phase, these parameters are removed without a change in the enhanced performance. IENs reduce the variance of an ordinary deep model by a fa…
▽ More
We introduce Inner Ensemble Networks (IENs) which reduce the variance within the neural network itself without an increase in the model complexity. IENs utilize ensemble parameters during the training phase to reduce the network variance. While in the testing phase, these parameters are removed without a change in the enhanced performance. IENs reduce the variance of an ordinary deep model by a factor of $1/m^{L-1}$, where $m$ is the number of inner ensembles and $L$ is the depth of the model. Also, we show empirically and theoretically that IENs lead to a greater variance reduction in comparison with other similar approaches such as dropout and maxout. Our results show a decrease of error rates between 1.7\% and 17.3\% in comparison with an ordinary deep model. We also show that IEN was preferred by Neural Architecture Search (NAS) methods over prior approaches. Code is available at https://github.com/abduallahmohamed/inner_ensemble_nets.
△ Less
Submitted 9 October, 2020; v1 submitted 15 June, 2020;
originally announced June 2020.
-
Exploring Long Tail Visual Relationship Recognition with Large Vocabulary
Authors:
Sherif Abdelkarim,
Aniket Agarwal,
Panos Achlioptas,
Jun Chen,
Jiaji Huang,
Boyang Li,
Kenneth Church,
Mohamed Elhoseiny
Abstract:
Several approaches have been proposed in recent literature to alleviate the long-tail problem, mainly in object classification tasks. In this paper, we make the first large-scale study concerning the task of Long-Tail Visual Relationship Recognition (LTVRR). LTVRR aims at improving the learning of structured visual relationships that come from the long-tail (e.g., "rabbit grazing on grass"). In th…
▽ More
Several approaches have been proposed in recent literature to alleviate the long-tail problem, mainly in object classification tasks. In this paper, we make the first large-scale study concerning the task of Long-Tail Visual Relationship Recognition (LTVRR). LTVRR aims at improving the learning of structured visual relationships that come from the long-tail (e.g., "rabbit grazing on grass"). In this setup, the subject, relation, and object classes each follow a long-tail distribution. To begin our study and make a future benchmark for the community, we introduce two LTVRR-related benchmarks, dubbed VG8K-LT and GQA-LT, built upon the widely used Visual Genome and GQA datasets. We use these benchmarks to study the performance of several state-of-the-art long-tail models on the LTVRR setup. Lastly, we propose a visiolinguistic hubless (VilHub) loss and a Mixup augmentation technique adapted to LTVRR setup, dubbed as RelMix. Both VilHub and RelMix can be easily integrated on top of existing models and despite being simple, our results show that they can remarkably improve the performance, especially on tail classes. Benchmarks, code, and models have been made available at: https://github.com/Vision-CAIR/LTVRR.
△ Less
Submitted 25 September, 2021; v1 submitted 25 March, 2020;
originally announced April 2020.
-
Uncertainty-guided Continual Learning with Bayesian Neural Networks
Authors:
Sayna Ebrahimi,
Mohamed Elhoseiny,
Trevor Darrell,
Marcus Rohrbach
Abstract:
Continual learning aims to learn new tasks without forgetting previously learned ones. This is especially challenging when one cannot access data from previous tasks and when the model has a fixed capacity. Current regularization-based continual learning algorithms need an external representation and extra computation to measure the parameters' \textit{importance}. In contrast, we propose Uncertai…
▽ More
Continual learning aims to learn new tasks without forgetting previously learned ones. This is especially challenging when one cannot access data from previous tasks and when the model has a fixed capacity. Current regularization-based continual learning algorithms need an external representation and extra computation to measure the parameters' \textit{importance}. In contrast, we propose Uncertainty-guided Continual Bayesian Neural Networks (UCB), where the learning rate adapts according to the uncertainty defined in the probability distribution of the weights in networks. Uncertainty is a natural way to identify \textit{what to remember} and \textit{what to change} as we continually learn, and thus mitigate catastrophic forgetting. We also show a variant of our model, which uses uncertainty for weight pruning and retains task performance after pruning by saving binary masks per tasks. We evaluate our UCB approach extensively on diverse object classification datasets with short and long sequences of tasks and report superior or on-par performance compared to existing approaches. Additionally, we show that our model does not necessarily need task information at test time, i.e. it does not presume knowledge of which task a sample belongs to.
△ Less
Submitted 19 February, 2020; v1 submitted 6 June, 2019;
originally announced June 2019.
-
Semi-Supervised Few-Shot Learning with Prototypical Random Walks
Authors:
Ahmed Ayyad,
Yuchen Li,
Nassir Navab,
Shadi Albarqouni,
Mohamed Elhoseiny
Abstract:
Recent progress has shown that few-shot learning can be improved with access to unlabelled data, known as semi-supervised few-shot learning(SS-FSL). We introduce an SS-FSL approach, dubbed as Prototypical Random Walk Networks(PRWN), built on top of Prototypical Networks (PN). We develop a random walk semi-supervised loss that enables the network to learn representations that are compact and well-s…
▽ More
Recent progress has shown that few-shot learning can be improved with access to unlabelled data, known as semi-supervised few-shot learning(SS-FSL). We introduce an SS-FSL approach, dubbed as Prototypical Random Walk Networks(PRWN), built on top of Prototypical Networks (PN). We develop a random walk semi-supervised loss that enables the network to learn representations that are compact and well-separated. Our work is related to the very recent development of graph-based approaches for few-shot learning. However, we show that compact and well-separated class representations can be achieved by modeling our prototypical random walk notion without needing additional graph-NN parameters or requiring a transductive setting where a collective test set is provided. Our model outperforms baselines in most benchmarks with significant improvements in some cases. Our model, trained with 40$\%$ of the data as labeled, compares competitively against fully supervised prototypical networks, trained on 100$\%$ of the labels, even outperforming it in the 1-shot mini-Imagenet case with 50.89$\%$ to 49.4$\%$ accuracy. We also show that our loss is resistant to distractors, unlabeled data that does not belong to any of the training classes, and hence reflecting robustness to labeled/unlabeled class distribution mismatch. Associated GitHub page can be found at https://prototypical-random-walk.github.io.
△ Less
Submitted 9 February, 2021; v1 submitted 5 March, 2019;
originally announced March 2019.
-
On Tiny Episodic Memories in Continual Learning
Authors:
Arslan Chaudhry,
Marcus Rohrbach,
Mohamed Elhoseiny,
Thalaiyasingam Ajanthan,
Puneet K. Dokania,
Philip H. S. Torr,
Marc'Aurelio Ranzato
Abstract:
In continual learning (CL), an agent learns from a stream of tasks leveraging prior experience to transfer knowledge to future tasks. It is an ideal framework to decrease the amount of supervision in the existing learning algorithms. But for a successful knowledge transfer, the learner needs to remember how to perform previous tasks. One way to endow the learner the ability to perform tasks seen i…
▽ More
In continual learning (CL), an agent learns from a stream of tasks leveraging prior experience to transfer knowledge to future tasks. It is an ideal framework to decrease the amount of supervision in the existing learning algorithms. But for a successful knowledge transfer, the learner needs to remember how to perform previous tasks. One way to endow the learner the ability to perform tasks seen in the past is to store a small memory, dubbed episodic memory, that stores few examples from previous tasks and then to replay these examples when training for future tasks. In this work, we empirically analyze the effectiveness of a very small episodic memory in a CL setup where each training example is only seen once. Surprisingly, across four rather different supervised learning benchmarks adapted to CL, a very simple baseline, that jointly trains on both examples from the current task as well as examples stored in the episodic memory, significantly outperforms specifically designed CL approaches with and without episodic memory. Interestingly, we find that repetitive training on even tiny memories of past tasks does not harm generalization, on the contrary, it improves it, with gains between 7\% and 17\% when the memory is populated with a single example per class.
△ Less
Submitted 4 June, 2019; v1 submitted 27 February, 2019;
originally announced February 2019.
-
Efficient Lifelong Learning with A-GEM
Authors:
Arslan Chaudhry,
Marc'Aurelio Ranzato,
Marcus Rohrbach,
Mohamed Elhoseiny
Abstract:
In lifelong learning, the learner is presented with a sequence of tasks, incrementally building a data-driven prior which may be leveraged to speed up learning of a new task. In this work, we investigate the efficiency of current lifelong approaches, in terms of sample complexity, computational and memory cost. Towards this end, we first introduce a new and a more realistic evaluation protocol, wh…
▽ More
In lifelong learning, the learner is presented with a sequence of tasks, incrementally building a data-driven prior which may be leveraged to speed up learning of a new task. In this work, we investigate the efficiency of current lifelong approaches, in terms of sample complexity, computational and memory cost. Towards this end, we first introduce a new and a more realistic evaluation protocol, whereby learners observe each example only once and hyper-parameter selection is done on a small and disjoint set of tasks, which is not used for the actual learning experience and evaluation. Second, we introduce a new metric measuring how quickly a learner acquires a new skill. Third, we propose an improved version of GEM (Lopez-Paz & Ranzato, 2017), dubbed Averaged GEM (A-GEM), which enjoys the same or even better performance as GEM, while being almost as computationally and memory efficient as EWC (Kirkpatrick et al., 2016) and other regularization-based methods. Finally, we show that all algorithms including A-GEM can learn even more quickly if they are provided with task descriptors specifying the classification tasks under consideration. Our experiments on several standard lifelong learning benchmarks demonstrate that A-GEM has the best trade-off between accuracy and efficiency.
△ Less
Submitted 9 January, 2019; v1 submitted 2 December, 2018;
originally announced December 2018.
-
GDPP: Learning Diverse Generations Using Determinantal Point Process
Authors:
Mohamed Elfeki,
Camille Couprie,
Morgane Riviere,
Mohamed Elhoseiny
Abstract:
Generative models have proven to be an outstanding tool for representing high-dimensional probability distributions and generating realistic-looking images. An essential characteristic of generative models is their ability to produce multi-modal outputs. However, while training, they are often susceptible to mode collapse, that is models are limited in mapping input noise to only a few modes of th…
▽ More
Generative models have proven to be an outstanding tool for representing high-dimensional probability distributions and generating realistic-looking images. An essential characteristic of generative models is their ability to produce multi-modal outputs. However, while training, they are often susceptible to mode collapse, that is models are limited in mapping input noise to only a few modes of the true data distribution. In this work, we draw inspiration from Determinantal Point Process (DPP) to propose an unsupervised penalty loss that alleviates mode collapse while producing higher quality samples. DPP is an elegant probabilistic measure used to model negative correlations within a subset and hence quantify its diversity. We use DPP kernel to model the diversity in real data as well as in synthetic data. Then, we devise an objective term that encourages generators to synthesize data with similar diversity to real data. In contrast to previous state-of-the-art generative models that tend to use additional trainable parameters or complex training paradigms, our method does not change the original training scheme. Embedded in an adversarial training and variational autoencoder, our Generative DPP approach shows a consistent resistance to mode-collapse on a wide variety of synthetic data and natural image datasets including MNIST, CIFAR10, and CelebA, while outperforming state-of-the-art methods for data-efficiency, generation quality, and convergence-time whereas being 5.8x faster than its closest competitor.
△ Less
Submitted 24 November, 2019; v1 submitted 30 November, 2018;
originally announced December 2018.
-
DeSIGN: Design Inspiration from Generative Networks
Authors:
Othman Sbai,
Mohamed Elhoseiny,
Antoine Bordes,
Yann LeCun,
Camille Couprie
Abstract:
Can an algorithm create original and compelling fashion designs to serve as an inspirational assistant? To help answer this question, we design and investigate different image generation models associated with different loss functions to boost creativity in fashion generation. The dimensions of our explorations include: (i) different Generative Adversarial Networks architectures that start from no…
▽ More
Can an algorithm create original and compelling fashion designs to serve as an inspirational assistant? To help answer this question, we design and investigate different image generation models associated with different loss functions to boost creativity in fashion generation. The dimensions of our explorations include: (i) different Generative Adversarial Networks architectures that start from noise vectors to generate fashion items, (ii) novel loss functions that encourage novelty, inspired from Sharma-Mittal divergence, a generalized mutual information measure for the widely used relative entropies such as Kullback-Leibler, and (iii) a generation process following the key elements of fashion design (disentangling shape and texture components). A key challenge of this study is the evaluation of generated designs and the retrieval of best ones, hence we put together an evaluation protocol associating automatic metrics and human experimental studies that we hope will help ease future research. We show that our proposed creativity criterion yield better overall appreciation than the one employed in Creative Adversarial Networks. In the end, about 61% of our images are thought to be created by human designers rather than by a computer while also being considered original per our human subject experiments, and our proposed loss scores the highest compared to existing losses in both novelty and likability.
△ Less
Submitted 14 September, 2018; v1 submitted 3 April, 2018;
originally announced April 2018.
-
Memory Aware Synapses: Learning what (not) to forget
Authors:
Rahaf Aljundi,
Francesca Babiloni,
Mohamed Elhoseiny,
Marcus Rohrbach,
Tinne Tuytelaars
Abstract:
Humans can learn in a continuous manner. Old rarely utilized knowledge can be overwritten by new incoming information while important, frequently used knowledge is prevented from being erased. In artificial learning systems, lifelong learning so far has focused mainly on accumulating knowledge over tasks and overcoming catastrophic forgetting. In this paper, we argue that, given the limited model…
▽ More
Humans can learn in a continuous manner. Old rarely utilized knowledge can be overwritten by new incoming information while important, frequently used knowledge is prevented from being erased. In artificial learning systems, lifelong learning so far has focused mainly on accumulating knowledge over tasks and overcoming catastrophic forgetting. In this paper, we argue that, given the limited model capacity and the unlimited new information to be learned, knowledge has to be preserved or erased selectively. Inspired by neuroplasticity, we propose a novel approach for lifelong learning, coined Memory Aware Synapses (MAS). It computes the importance of the parameters of a neural network in an unsupervised and online manner. Given a new sample which is fed to the network, MAS accumulates an importance measure for each parameter of the network, based on how sensitive the predicted output function is to a change in this parameter. When learning a new task, changes to important parameters can then be penalized, effectively preventing important knowledge related to previous tasks from being overwritten. Further, we show an interesting connection between a local version of our method and Hebb's rule,which is a model for the learning process in the brain. We test our method on a sequence of object recognition tasks and on the challenging problem of learning an embedding for predicting $<$subject, predicate, object$>$ triplets. We show state-of-the-art performance and, for the first time, the ability to adapt the importance of the parameters based on unlabeled data towards what the network needs (not) to forget, which may vary depending on test conditions.
△ Less
Submitted 5 October, 2018; v1 submitted 27 November, 2017;
originally announced November 2017.
-
Generalized Twin Gaussian Processes using Sharma-Mittal Divergence
Authors:
Mohamed Elhoseiny,
Ahmed Elgammal
Abstract:
There has been a growing interest in mutual information measures due to their wide range of applications in Machine Learning and Computer Vision. In this paper, we present a generalized structured regression framework based on Shama-Mittal divergence, a relative entropy measure, which is introduced to the Machine Learning community in this work. Sharma-Mittal (SM) divergence is a generalized mutua…
▽ More
There has been a growing interest in mutual information measures due to their wide range of applications in Machine Learning and Computer Vision. In this paper, we present a generalized structured regression framework based on Shama-Mittal divergence, a relative entropy measure, which is introduced to the Machine Learning community in this work. Sharma-Mittal (SM) divergence is a generalized mutual information measure for the widely used Rényi, Tsallis, Bhattacharyya, and Kullback-Leibler (KL) relative entropies. Specifically, we study Sharma-Mittal divergence as a cost function in the context of the Twin Gaussian Processes (TGP)~\citep{Bo:2010}, which generalizes over the KL-divergence without computational penalty. We show interesting properties of Sharma-Mittal TGP (SMTGP) through a theoretical analysis, which covers missing insights in the traditional TGP formulation. However, we generalize this theory based on SM-divergence instead of KL-divergence which is a special case. Experimentally, we evaluated the proposed SMTGP framework on several datasets. The results show that SMTGP reaches better predictions than KL-based TGP, since it offers a bigger class of models through its parameters that we learn from the data.
△ Less
Submitted 1 June, 2015; v1 submitted 26 September, 2014;
originally announced September 2014.