Search | arXiv e-print repository

Improving LoRA with Variational Learning

Authors: Bai Cong, Nico Daheim, Yuesong Shen, Rio Yokota, Mohammad Emtiyaz Khan, Thomas Möllenhoff

Abstract: Bayesian methods have recently been used to improve LoRA finetuning and, although they improve calibration, their effect on other metrics (such as accuracy) is marginal and can sometimes even be detrimental. Moreover, Bayesian methods also increase computational overheads and require additional tricks for them to work well. Here, we fix these issues by using a recently proposed variational algorit… ▽ More Bayesian methods have recently been used to improve LoRA finetuning and, although they improve calibration, their effect on other metrics (such as accuracy) is marginal and can sometimes even be detrimental. Moreover, Bayesian methods also increase computational overheads and require additional tricks for them to work well. Here, we fix these issues by using a recently proposed variational algorithm called IVON. We show that IVON is easy to implement and has similar costs to AdamW, and yet it can also drastically improve many metrics by using a simple posterior pruning technique. We present extensive results on billion-scale LLMs (Llama and Qwen series) going way beyond the scale of existing applications of IVON. For example, we finetune a Llama-3.2-3B model on a set of commonsense reasoning tasks and improve accuracy over AdamW by 1.3% and reduce ECE by 5.4%, outperforming AdamW and other recent Bayesian methods like Laplace-LoRA and BLoB. Overall, our results show that variational learning with IVON can effectively improve LoRA finetuning. △ Less

Submitted 17 June, 2025; originally announced June 2025.

Comments: 16 pages, 4 figures

arXiv:2506.14262 [pdf, ps, other]

Knowledge Adaptation as Posterior Correction

Authors: Mohammad Emtiyaz Khan

Abstract: Adaptation is the holy grail of intelligence, but even the best AI models (like GPT) lack the adaptivity of toddlers. So the question remains: how can machines adapt quickly? Despite a lot of progress on model adaptation to facilitate continual and federated learning, as well as model merging, editing, unlearning, etc., little is known about the mechanisms by which machines can naturally learn to… ▽ More Adaptation is the holy grail of intelligence, but even the best AI models (like GPT) lack the adaptivity of toddlers. So the question remains: how can machines adapt quickly? Despite a lot of progress on model adaptation to facilitate continual and federated learning, as well as model merging, editing, unlearning, etc., little is known about the mechanisms by which machines can naturally learn to adapt in a similar way as humans and animals. Here, we show that all such adaptation methods can be seen as different ways of `correcting' the approximate posteriors. More accurate posteriors lead to smaller corrections, which in turn imply quicker adaptation. The result is obtained by using a dual-perspective of the Bayesian Learning Rule of Khan and Rue (2023) where interference created during adaptation is characterized by the natural-gradient mismatch over the past data. We present many examples to demonstrate the use of posterior-correction as a natural mechanism for the machines to learn to adapt quickly. △ Less

Submitted 17 June, 2025; originally announced June 2025.

arXiv:2506.13150 [pdf, ps, other]

Federated ADMM from Bayesian Duality

Authors: Thomas Möllenhoff, Siddharth Swaroop, Finale Doshi-Velez, Mohammad Emtiyaz Khan

Abstract: ADMM is a popular method for federated deep learning which originated in the 1970s and, even though many new variants of it have been proposed since then, its core algorithmic structure has remained unchanged. Here, we take a major departure from the old structure and present a fundamentally new way to derive and extend federated ADMM. We propose to use a structure called Bayesian Duality which ex… ▽ More ADMM is a popular method for federated deep learning which originated in the 1970s and, even though many new variants of it have been proposed since then, its core algorithmic structure has remained unchanged. Here, we take a major departure from the old structure and present a fundamentally new way to derive and extend federated ADMM. We propose to use a structure called Bayesian Duality which exploits a duality of the posterior distributions obtained by solving a variational-Bayesian reformulation of the original problem. We show that this naturally recovers the original ADMM when isotropic Gaussian posteriors are used, and yields non-trivial extensions for other posterior forms. For instance, full-covariance Gaussians lead to Newton-like variants of ADMM, while diagonal covariances result in a cheap Adam-like variant. This is especially useful to handle heterogeneity in federated deep learning, giving up to 7% accuracy improvements over recent baselines. Our work opens a new Bayesian path to improve primal-dual methods. △ Less

Submitted 16 June, 2025; originally announced June 2025.

Comments: Code is at https://github.com/team-approx-bayes/bayes-admm

arXiv:2506.12903 [pdf, ps, other]

Variational Learning Finds Flatter Solutions at the Edge of Stability

Authors: Avrajit Ghosh, Bai Cong, Rio Yokota, Saiprasad Ravishankar, Rongrong Wang, Molei Tao, Mohammad Emtiyaz Khan, Thomas Möllenhoff

Abstract: Variational Learning (VL) has recently gained popularity for training deep neural networks and is competitive to standard learning methods. Part of its empirical success can be explained by theories such as PAC-Bayes bounds, minimum description length and marginal likelihood, but there are few tools to unravel the implicit regularization in play. Here, we analyze the implicit regularization of VL… ▽ More Variational Learning (VL) has recently gained popularity for training deep neural networks and is competitive to standard learning methods. Part of its empirical success can be explained by theories such as PAC-Bayes bounds, minimum description length and marginal likelihood, but there are few tools to unravel the implicit regularization in play. Here, we analyze the implicit regularization of VL through the Edge of Stability (EoS) framework. EoS has previously been used to show that gradient descent can find flat solutions and we extend this result to VL to show that it can find even flatter solutions. This is obtained by controlling the posterior covariance and the number of Monte Carlo samples from the posterior. These results are derived in a similar fashion as the standard EoS literature for deep learning, by first deriving a result for a quadratic problem and then extending it to deep neural networks. We empirically validate these findings on a wide variety of large networks, such as ResNet and ViT, to find that the theoretical results closely match the empirical ones. Ours is the first work to analyze the EoS dynamics in VL. △ Less

Submitted 15 June, 2025; originally announced June 2025.

arXiv:2501.17325 [pdf, other]

Connecting Federated ADMM to Bayes

Authors: Siddharth Swaroop, Mohammad Emtiyaz Khan, Finale Doshi-Velez

Abstract: We provide new connections between two distinct federated learning approaches based on (i) ADMM and (ii) Variational Bayes (VB), and propose new variants by combining their complementary strengths. Specifically, we show that the dual variables in ADMM naturally emerge through the 'site' parameters used in VB with isotropic Gaussian covariances. Using this, we derive two versions of ADMM from VB th… ▽ More We provide new connections between two distinct federated learning approaches based on (i) ADMM and (ii) Variational Bayes (VB), and propose new variants by combining their complementary strengths. Specifically, we show that the dual variables in ADMM naturally emerge through the 'site' parameters used in VB with isotropic Gaussian covariances. Using this, we derive two versions of ADMM from VB that use flexible covariances and functional regularisation, respectively. Through numerical experiments, we validate the improvements obtained in performance. The work shows connection between two fields that are believed to be fundamentally different and combines them to improve federated learning. △ Less

Submitted 28 February, 2025; v1 submitted 28 January, 2025; originally announced January 2025.

arXiv:2501.16988 [pdf, other]

Marginal and Conditional Importance Measures from Machine Learning Models and Their Relationship with Conditional Average Treatment Effect

Authors: Mohammad Kaviul Anam Khan, Olli Saarela, Rafal Kustra

Abstract: Interpreting black-box machine learning models is challenging due to their strong dependence on data and inherently non-parametric nature. This paper reintroduces the concept of importance through "Marginal Variable Importance Metric" (MVIM), a model-agnostic measure of predictor importance based on the true conditional expectation function. MVIM evaluates predictors' influence on continuous or di… ▽ More Interpreting black-box machine learning models is challenging due to their strong dependence on data and inherently non-parametric nature. This paper reintroduces the concept of importance through "Marginal Variable Importance Metric" (MVIM), a model-agnostic measure of predictor importance based on the true conditional expectation function. MVIM evaluates predictors' influence on continuous or discrete outcomes. A permutation-based estimation approach, inspired by \citet{breiman2001random} and \citet{fisher2019all}, is proposed to estimate MVIM. MVIM estimator is biased when predictors are highly correlated, as black-box models struggle to extrapolate in low-probability regions. To address this, we investigated the bias-variance decomposition of MVIM to understand the source and pattern of the bias under high correlation. A Conditional Variable Importance Metric (CVIM), adapted from \citet{strobl2008conditional}, is introduced to reduce this bias. Both MVIM and CVIM exhibit a quadratic relationship with the conditional average treatment effect (CATE). △ Less

Submitted 28 January, 2025; v1 submitted 28 January, 2025; originally announced January 2025.

arXiv:2501.04667 [pdf, other]

Natural Variational Annealing for Multimodal Optimization

Authors: Tâm Le Minh, Julyan Arbel, Thomas Möllenhoff, Mohammad Emtiyaz Khan, Florence Forbes

Abstract: We introduce a new multimodal optimization approach called Natural Variational Annealing (NVA) that combines the strengths of three foundational concepts to simultaneously search for multiple global and local modes of black-box nonconvex objectives. First, it implements a simultaneous search by using variational posteriors, such as, mixtures of Gaussians. Second, it applies annealing to gradually… ▽ More We introduce a new multimodal optimization approach called Natural Variational Annealing (NVA) that combines the strengths of three foundational concepts to simultaneously search for multiple global and local modes of black-box nonconvex objectives. First, it implements a simultaneous search by using variational posteriors, such as, mixtures of Gaussians. Second, it applies annealing to gradually trade off exploration for exploitation. Finally, it learns the variational search distribution using natural-gradient learning where updates resemble well-known and easy-to-implement algorithms. The three concepts come together in NVA giving rise to new algorithms and also allowing us to incorporate "fitness shaping", a core concept from evolutionary algorithms. We assess the quality of search on simulations and compare them to methods using gradient descent and evolution strategies. We also provide an application to a real-world inverse problem in planetary science. △ Less

Submitted 11 February, 2025; v1 submitted 8 January, 2025; originally announced January 2025.

arXiv:2412.08147 [pdf, other]

How to Weight Multitask Finetuning? Fast Previews via Bayesian Model-Merging

Authors: Hugo Monzón Maldonado, Thomas Möllenhoff, Nico Daheim, Iryna Gurevych, Mohammad Emtiyaz Khan

Abstract: When finetuning multiple tasks altogether, it is important to carefully weigh them to get a good performance, but searching for good weights can be difficult and costly. Here, we propose to aid the search with fast previews to quickly get a rough idea of different reweighting options. We use model merging to create previews by simply reusing and averaging parameters of models trained on each task… ▽ More When finetuning multiple tasks altogether, it is important to carefully weigh them to get a good performance, but searching for good weights can be difficult and costly. Here, we propose to aid the search with fast previews to quickly get a rough idea of different reweighting options. We use model merging to create previews by simply reusing and averaging parameters of models trained on each task separately (no retraining required). To improve the quality of previews, we propose a Bayesian approach to design new merging strategies by using more flexible posteriors. We validate our findings on vision and natural-language transformers. Our work shows the benefits of model merging via Bayes to improve multitask finetuning. △ Less

Submitted 11 December, 2024; originally announced December 2024.

arXiv:2411.04421 [pdf, other]

Variational Low-Rank Adaptation Using IVON

Authors: Bai Cong, Nico Daheim, Yuesong Shen, Daniel Cremers, Rio Yokota, Mohammad Emtiyaz Khan, Thomas Möllenhoff

Abstract: We show that variational learning can significantly improve the accuracy and calibration of Low-Rank Adaptation (LoRA) without a substantial increase in the cost. We replace AdamW by the Improved Variational Online Newton (IVON) algorithm to finetune large language models. For Llama-2 with 7 billion parameters, IVON improves the accuracy over AdamW by 2.8% and expected calibration error by 4.6%. T… ▽ More We show that variational learning can significantly improve the accuracy and calibration of Low-Rank Adaptation (LoRA) without a substantial increase in the cost. We replace AdamW by the Improved Variational Online Newton (IVON) algorithm to finetune large language models. For Llama-2 with 7 billion parameters, IVON improves the accuracy over AdamW by 2.8% and expected calibration error by 4.6%. The accuracy is also better than the other Bayesian alternatives, yet the cost is lower and the implementation is easier. Our work provides additional evidence for the effectiveness of IVON for large language models. The code is available at https://github.com/team-approx-bayes/ivon-lora. △ Less

Submitted 9 November, 2024; v1 submitted 6 November, 2024; originally announced November 2024.

Comments: Published at 38th Workshop on Fine-Tuning in Machine Learning (NeurIPS 2024). Code available at https://github.com/team-approx-bayes/ivon-lora. In version 2 we fixed a typo in the equation of prior in section 2

arXiv:2404.08168 [pdf, other]

Conformal Prediction via Regression-as-Classification

Authors: Etash Guha, Shlok Natarajan, Thomas Möllenhoff, Mohammad Emtiyaz Khan, Eugene Ndiaye

Abstract: Conformal prediction (CP) for regression can be challenging, especially when the output distribution is heteroscedastic, multimodal, or skewed. Some of the issues can be addressed by estimating a distribution over the output, but in reality, such approaches can be sensitive to estimation error and yield unstable intervals.~Here, we circumvent the challenges by converting regression to a classifica… ▽ More Conformal prediction (CP) for regression can be challenging, especially when the output distribution is heteroscedastic, multimodal, or skewed. Some of the issues can be addressed by estimating a distribution over the output, but in reality, such approaches can be sensitive to estimation error and yield unstable intervals.~Here, we circumvent the challenges by converting regression to a classification problem and then use CP for classification to obtain CP sets for regression.~To preserve the ordering of the continuous-output space, we design a new loss function and make necessary modifications to the CP classification techniques.~Empirical results on many benchmarks shows that this simple approach gives surprisingly good results on many practical problems. △ Less

Submitted 11 April, 2024; originally announced April 2024.

Comments: International Conference of Learning Representations 2024

Journal ref: International Conference of Learning Representations 2024

arXiv:2402.17641 [pdf, other]

Variational Learning is Effective for Large Deep Networks

Authors: Yuesong Shen, Nico Daheim, Bai Cong, Peter Nickl, Gian Maria Marconi, Clement Bazan, Rio Yokota, Iryna Gurevych, Daniel Cremers, Mohammad Emtiyaz Khan, Thomas Möllenhoff

Abstract: We give extensive empirical evidence against the common belief that variational learning is ineffective for large neural networks. We show that an optimizer called Improved Variational Online Newton (IVON) consistently matches or outperforms Adam for training large networks such as GPT-2 and ResNets from scratch. IVON's computational costs are nearly identical to Adam but its predictive uncertaint… ▽ More We give extensive empirical evidence against the common belief that variational learning is ineffective for large neural networks. We show that an optimizer called Improved Variational Online Newton (IVON) consistently matches or outperforms Adam for training large networks such as GPT-2 and ResNets from scratch. IVON's computational costs are nearly identical to Adam but its predictive uncertainty is better. We show several new use cases of IVON where we improve finetuning and model merging in Large Language Models, accurately predict generalization error, and faithfully estimate sensitivity to data. We find overwhelming evidence that variational learning is effective. △ Less

Submitted 6 June, 2024; v1 submitted 27 February, 2024; originally announced February 2024.

Comments: Published at International Conference on Machine Learning (ICML), 2024. The first two authors contributed equally. Code is available here: https://github.com/team-approx-bayes/ivon

arXiv:2402.00809 [pdf, other]

Position: Bayesian Deep Learning is Needed in the Age of Large-Scale AI

Authors: Theodore Papamarkou, Maria Skoularidou, Konstantina Palla, Laurence Aitchison, Julyan Arbel, David Dunson, Maurizio Filippone, Vincent Fortuin, Philipp Hennig, José Miguel Hernández-Lobato, Aliaksandr Hubin, Alexander Immer, Theofanis Karaletsos, Mohammad Emtiyaz Khan, Agustinus Kristiadi, Yingzhen Li, Stephan Mandt, Christopher Nemeth, Michael A. Osborne, Tim G. J. Rudner, David Rügamer, Yee Whye Teh, Max Welling, Andrew Gordon Wilson, Ruqi Zhang

Abstract: In the current landscape of deep learning research, there is a predominant emphasis on achieving high predictive accuracy in supervised tasks involving large image and language datasets. However, a broader perspective reveals a multitude of overlooked metrics, tasks, and data types, such as uncertainty, active and continual learning, and scientific data, that demand attention. Bayesian deep learni… ▽ More In the current landscape of deep learning research, there is a predominant emphasis on achieving high predictive accuracy in supervised tasks involving large image and language datasets. However, a broader perspective reveals a multitude of overlooked metrics, tasks, and data types, such as uncertainty, active and continual learning, and scientific data, that demand attention. Bayesian deep learning (BDL) constitutes a promising avenue, offering advantages across these diverse settings. This paper posits that BDL can elevate the capabilities of deep learning. It revisits the strengths of BDL, acknowledges existing challenges, and highlights some exciting research avenues aimed at addressing these obstacles. Looking ahead, the discussion focuses on possible ways to combine large-scale foundation models with BDL to unlock their full potential. △ Less

Submitted 6 August, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

Comments: Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024

arXiv:2401.06261 [pdf, other]

Prediction of causal genes at GWAS loci with pleiotropic gene regulatory effects using sets of correlated instrumental variables

Authors: Mariyam Khan, Adriaan-Alexander Ludl, Sean Bankier, Johan Bjorkegren, Tom Michoel

Abstract: Multivariate Mendelian randomization (MVMR) is a statistical technique that uses sets of genetic instruments to estimate the direct causal effects of multiple exposures on an outcome of interest. At genomic loci with pleiotropic gene regulatory effects, that is, loci where the same genetic variants are associated to multiple nearby genes, MVMR can potentially be used to predict candidate causal ge… ▽ More Multivariate Mendelian randomization (MVMR) is a statistical technique that uses sets of genetic instruments to estimate the direct causal effects of multiple exposures on an outcome of interest. At genomic loci with pleiotropic gene regulatory effects, that is, loci where the same genetic variants are associated to multiple nearby genes, MVMR can potentially be used to predict candidate causal genes. However, consensus in the field dictates that the genetic instruments in MVMR must be independent, which is usually not possible when considering a group of candidate genes from the same locus. We used causal inference theory to show that MVMR with correlated instruments satisfies the instrumental set condition. This is a classical result by Brito and Pearl (2002) for structural equation models that guarantees the identifiability of causal effects in situations where multiple exposures collectively, but not individually, separate a set of instrumental variables from an outcome variable. Extensive simulations confirmed the validity and usefulness of these theoretical results even at modest sample sizes. Importantly, the causal effect estimates remain unbiased and their variance small when instruments are highly correlated. We applied MVMR with correlated instrumental variable sets at risk loci from genome-wide association studies (GWAS) for coronary artery disease using eQTL data from the STARNET study. Our method predicts causal genes at twelve loci, each associated with multiple colocated genes in multiple tissues. However, the extensive degree of regulatory pleiotropy across tissues and the limited number of causal variants in each locus still require that MVMR is run on a tissue-by-tissue basis, and testing all gene-tissue pairs at a given locus in a single model to predict causal gene-tissue combinations remains infeasible. △ Less

Submitted 20 September, 2024; v1 submitted 11 January, 2024; originally announced January 2024.

Comments: Revised version, 31 pages, 5 figures. "TeX Source" contains file SI.pdf with Supplementary Information (26 pages, 9 figures). Code available at https://github.com/mariyam-khan/Causal_genes_GWAS_loci_CAD . Supporting data available at https://dataverse.no/dataset.xhtml?persistentId=doi:10.18710/VM0WKQ

arXiv:2311.01669 [pdf]

Motor vehicles accidents and teenage drivers: A statistical analysis of their age and injuries

Authors: Debo Brata Paul Argha, Md Javed Imtiaze Khan

Abstract: Motorcycle accidents are a prevalent problem in Texas, resulting in hundreds of injuries and deaths each year. Motorcycles provide the driver with little physical protection during accidents compared to cars and other vehicles, so when there is a collision involving a motorcycle, the motorcyclist is likely to be injured. While there are numerous reasons for motorcycle accidents, most are caused by… ▽ More Motorcycle accidents are a prevalent problem in Texas, resulting in hundreds of injuries and deaths each year. Motorcycles provide the driver with little physical protection during accidents compared to cars and other vehicles, so when there is a collision involving a motorcycle, the motorcyclist is likely to be injured. While there are numerous reasons for motorcycle accidents, most are caused by negligence and could have been avoided. Because of the increasing popularity of motorcycles and scooter in Texas, coupled with an increase in the number of motorcycle accidents, the Texas Department of Transportation (TxDOT) has amped its efforts to improve motorcycle safety. From the data, it has been visible that teenage drivers are the most vulnerable to motorcycle accidents. In this report, we have tried to find out the probability of young driver and passenger motorcyclist's injury based on different conditions and to predict the rate of changing injury to this group in the upcoming years. △ Less

Submitted 2 November, 2023; originally announced November 2023.

Comments: 10 pages

arXiv:2310.19273 [pdf, other]

The Memory Perturbation Equation: Understanding Model's Sensitivity to Data

Authors: Peter Nickl, Lu Xu, Dharmesh Tailor, Thomas Möllenhoff, Mohammad Emtiyaz Khan

Abstract: Understanding model's sensitivity to its training data is crucial but can also be challenging and costly, especially during training. To simplify such issues, we present the Memory-Perturbation Equation (MPE) which relates model's sensitivity to perturbation in its training data. Derived using Bayesian principles, the MPE unifies existing sensitivity measures, generalizes them to a wide-variety of… ▽ More Understanding model's sensitivity to its training data is crucial but can also be challenging and costly, especially during training. To simplify such issues, we present the Memory-Perturbation Equation (MPE) which relates model's sensitivity to perturbation in its training data. Derived using Bayesian principles, the MPE unifies existing sensitivity measures, generalizes them to a wide-variety of models and algorithms, and unravels useful properties regarding sensitivities. Our empirical results show that sensitivity estimates obtained during training can be used to faithfully predict generalization on unseen test data. The proposed equation is expected to be useful for future research on robust and adaptive learning. △ Less

Submitted 16 January, 2024; v1 submitted 30 October, 2023; originally announced October 2023.

Comments: 37th Conference on Neural Information Processing Systems (NeurIPS 2023)

arXiv:2310.14348 [pdf, other]

DePAint: A Decentralized Safe Multi-Agent Reinforcement Learning Algorithm considering Peak and Average Constraints

Authors: Raheeb Hassan, K. M. Shadman Wadith, Md. Mamun or Rashid, Md. Mosaddek Khan

Abstract: The domain of safe multi-agent reinforcement learning (MARL), despite its potential applications in areas ranging from drone delivery and vehicle automation to the development of zero-energy communities, remains relatively unexplored. The primary challenge involves training agents to learn optimal policies that maximize rewards while adhering to stringent safety constraints, all without the oversi… ▽ More The domain of safe multi-agent reinforcement learning (MARL), despite its potential applications in areas ranging from drone delivery and vehicle automation to the development of zero-energy communities, remains relatively unexplored. The primary challenge involves training agents to learn optimal policies that maximize rewards while adhering to stringent safety constraints, all without the oversight of a central controller. These constraints are critical in a wide array of applications. Moreover, ensuring the privacy of sensitive information in decentralized settings introduces an additional layer of complexity, necessitating innovative solutions that uphold privacy while achieving the system's safety and efficiency goals. In this paper, we address the problem of multi-agent policy optimization in a decentralized setting, where agents communicate with their neighbors to maximize the sum of their cumulative rewards while also satisfying each agent's safety constraints. We consider both peak and average constraints. In this scenario, there is no central controller coordinating the agents and both the rewards and constraints are only known to each agent locally/privately. We formulate the problem as a decentralized constrained multi-agent Markov Decision Problem and propose a momentum-based decentralized policy gradient method, DePAint, to solve it. To the best of our knowledge, this is the first privacy-preserving fully decentralized multi-agent reinforcement learning algorithm that considers both peak and average constraints. We then provide theoretical analysis and empirical evaluation of our algorithm in a number of scenarios and compare its performance to centralized algorithms that consider similar constraints. △ Less

Submitted 3 April, 2024; v1 submitted 22 October, 2023; originally announced October 2023.

Comments: accepted for publication in Springer Applied Intelligence Journal

arXiv:2310.10553 [pdf, other]

TacticAI: an AI assistant for football tactics

Authors: Zhe Wang, Petar Veličković, Daniel Hennes, Nenad Tomašev, Laurel Prince, Michael Kaisers, Yoram Bachrach, Romuald Elie, Li Kevin Wenliang, Federico Piccinini, William Spearman, Ian Graham, Jerome Connor, Yi Yang, Adrià Recasens, Mina Khan, Nathalie Beauguerlange, Pablo Sprechmann, Pol Moreno, Nicolas Heess, Michael Bowling, Demis Hassabis, Karl Tuyls

Abstract: Identifying key patterns of tactics implemented by rival teams, and developing effective responses, lies at the heart of modern football. However, doing so algorithmically remains an open research challenge. To address this unmet need, we propose TacticAI, an AI football tactics assistant developed and evaluated in close collaboration with domain experts from Liverpool FC. We focus on analysing co… ▽ More Identifying key patterns of tactics implemented by rival teams, and developing effective responses, lies at the heart of modern football. However, doing so algorithmically remains an open research challenge. To address this unmet need, we propose TacticAI, an AI football tactics assistant developed and evaluated in close collaboration with domain experts from Liverpool FC. We focus on analysing corner kicks, as they offer coaches the most direct opportunities for interventions and improvements. TacticAI incorporates both a predictive and a generative component, allowing the coaches to effectively sample and explore alternative player setups for each corner kick routine and to select those with the highest predicted likelihood of success. We validate TacticAI on a number of relevant benchmark tasks: predicting receivers and shot attempts and recommending player position adjustments. The utility of TacticAI is validated by a qualitative study conducted with football domain experts at Liverpool FC. We show that TacticAI's model suggestions are not only indistinguishable from real tactics, but also favoured over existing tactics 90% of the time, and that TacticAI offers an effective corner kick retrieval system. TacticAI achieves these results despite the limited availability of gold-standard data, achieving data efficiency through geometric deep learning. △ Less

Submitted 17 October, 2023; v1 submitted 16 October, 2023; originally announced October 2023.

Comments: 32 pages, 10 figures

arXiv:2306.15169 [pdf, other]

Exploiting Inferential Structure in Neural Processes

Authors: Dharmesh Tailor, Mohammad Emtiyaz Khan, Eric Nalisnick

Abstract: Neural Processes (NPs) are appealing due to their ability to perform fast adaptation based on a context set. This set is encoded by a latent variable, which is often assumed to follow a simple distribution. However, in real-word settings, the context set may be drawn from richer distributions having multiple modes, heavy tails, etc. In this work, we provide a framework that allows NPs' latent vari… ▽ More Neural Processes (NPs) are appealing due to their ability to perform fast adaptation based on a context set. This set is encoded by a latent variable, which is often assumed to follow a simple distribution. However, in real-word settings, the context set may be drawn from richer distributions having multiple modes, heavy tails, etc. In this work, we provide a framework that allows NPs' latent variable to be given a rich prior defined by a graphical model. These distributional assumptions directly translate into an appropriate aggregation strategy for the context set. Moreover, we describe a message-passing procedure that still allows for end-to-end optimization with stochastic gradients. We demonstrate the generality of our framework by using mixture and Student-t assumptions that yield improvements in function modelling and test-time robustness. △ Less

Submitted 26 June, 2023; originally announced June 2023.

Comments: Uncertainty in Artificial Intelligence (UAI) 2023

arXiv:2306.03566 [pdf, other]

Memory-Based Dual Gaussian Processes for Sequential Learning

Authors: Paul E. Chang, Prakhar Verma, S. T. John, Arno Solin, Mohammad Emtiyaz Khan

Abstract: Sequential learning with Gaussian processes (GPs) is challenging when access to past data is limited, for example, in continual and active learning. In such cases, errors can accumulate over time due to inaccuracies in the posterior, hyperparameters, and inducing points, making accurate learning challenging. Here, we present a method to keep all such errors in check using the recently proposed dua… ▽ More Sequential learning with Gaussian processes (GPs) is challenging when access to past data is limited, for example, in continual and active learning. In such cases, errors can accumulate over time due to inaccuracies in the posterior, hyperparameters, and inducing points, making accurate learning challenging. Here, we present a method to keep all such errors in check using the recently proposed dual sparse variational GP. Our method enables accurate inference for generic likelihoods and improves learning by actively building and updating a memory of past data. We demonstrate its effectiveness in several applications involving Bayesian optimization, active learning, and continual learning. △ Less

Submitted 6 June, 2023; originally announced June 2023.

Comments: International Conference on Machine Learning (ICML) 2023

arXiv:2304.14251 [pdf, ps, other]

Variational Bayes Made Easy

Authors: Mohammad Emtiyaz Khan

Abstract: Variational Bayes is a popular method for approximate inference but its derivation can be cumbersome. To simplify the process, we give a 3-step recipe to identify the posterior form by explicitly looking for linearity with respect to expectations of well-known distributions. We can then directly write the update by simply ``reading-off'' the terms in front of those expectations. The recipe makes t… ▽ More Variational Bayes is a popular method for approximate inference but its derivation can be cumbersome. To simplify the process, we give a 3-step recipe to identify the posterior form by explicitly looking for linearity with respect to expectations of well-known distributions. We can then directly write the update by simply ``reading-off'' the terms in front of those expectations. The recipe makes the derivation easier, faster, shorter, and more general. △ Less

Submitted 10 July, 2023; v1 submitted 27 April, 2023; originally announced April 2023.

Journal ref: Presented at the 5th Symposium on Advances in Approximate Bayesian Inference (AABI 2023)

arXiv:2303.12210 [pdf, ps, other]

A Random Projection k Nearest Neighbours Ensemble for Classification via Extended Neighbourhood Rule

Authors: Amjad Ali, Muhammad Hamraz, Dost Muhammad Khan, Wajdan Deebani, Zardad Khan

Abstract: Ensembles based on k nearest neighbours (kNN) combine a large number of base learners, each constructed on a sample taken from a given training data. Typical kNN based ensembles determine the k closest observations in the training data bounded to a test sample point by a spherical region to predict its class. In this paper, a novel random projection extended neighbourhood rule (RPExNRule) ensemble… ▽ More Ensembles based on k nearest neighbours (kNN) combine a large number of base learners, each constructed on a sample taken from a given training data. Typical kNN based ensembles determine the k closest observations in the training data bounded to a test sample point by a spherical region to predict its class. In this paper, a novel random projection extended neighbourhood rule (RPExNRule) ensemble is proposed where bootstrap samples from the given training data are randomly projected into lower dimensions for additional randomness in the base models and to preserve features information. It uses the extended neighbourhood rule (ExNRule) to fit kNN as base learners on randomly projected bootstrap samples. △ Less

Submitted 21 March, 2023; originally announced March 2023.

Comments: 23 pages, 8 diagrams, 69 references

ACM Class: F.2.2

arXiv:2303.04397 [pdf, other]

The Lie-Group Bayesian Learning Rule

Authors: Eren Mehmet Kıral, Thomas Möllenhoff, Mohammad Emtiyaz Khan

Abstract: The Bayesian Learning Rule provides a framework for generic algorithm design but can be difficult to use for three reasons. First, it requires a specific parameterization of exponential family. Second, it uses gradients which can be difficult to compute. Third, its update may not always stay on the manifold. We address these difficulties by proposing an extension based on Lie-groups where posterio… ▽ More The Bayesian Learning Rule provides a framework for generic algorithm design but can be difficult to use for three reasons. First, it requires a specific parameterization of exponential family. Second, it uses gradients which can be difficult to compute. Third, its update may not always stay on the manifold. We address these difficulties by proposing an extension based on Lie-groups where posteriors are parametrized through transformations of an arbitrary base distribution and updated via the group's exponential map. This simplifies all three difficulties for many cases, providing flexible parametrizations through group's action, simple gradient computation through reparameterization, and updates that always stay on the manifold. We use the new learning rule to derive a new algorithm for deep learning with desirable biologically-plausible attributes to learn sparse features. Our work opens a new frontier for the design of new algorithms by exploiting Lie-group structures. △ Less

Submitted 8 March, 2023; originally announced March 2023.

Comments: AISTATS 2023

arXiv:2303.01954 [pdf, other]

Synthetic Data Generator for Adaptive Interventions in Global Health

Authors: Aditya Rastogi, Juan Francisco Garamendi, Ana Fernández del Río, Anna Guitart, Moiz Hassan Khan, Dexian Tang, África Periáñez

Abstract: Artificial Intelligence and digital health have the potential to transform global health. However, having access to representative data to test and validate algorithms in realistic production environments is essential. We introduce HealthSyn, an open-source synthetic data generator of user behavior for testing reinforcement learning algorithms in the context of mobile health interventions. The gen… ▽ More Artificial Intelligence and digital health have the potential to transform global health. However, having access to representative data to test and validate algorithms in realistic production environments is essential. We introduce HealthSyn, an open-source synthetic data generator of user behavior for testing reinforcement learning algorithms in the context of mobile health interventions. The generator utilizes Markov processes to generate diverse user actions, with individual user behavioral patterns that can change in reaction to personalized interventions (i.e., reminders, recommendations, and incentives). These actions are translated into actual logs using an ML-purposed data schema specific to the mobile health application functionality included with HealthKit, and open-source SDK. The logs can be fed to pipelines to obtain user metrics. The generated data, which is based on real-world behaviors and simulation techniques, can be used to develop, test, and evaluate, both ML algorithms in research and end-to-end operational RL-based intervention delivery frameworks. △ Less

Submitted 27 April, 2023; v1 submitted 3 March, 2023; originally announced March 2023.

arXiv:2302.09738 [pdf, other]

Simplifying Momentum-based Positive-definite Submanifold Optimization with Applications to Deep Learning

Authors: Wu Lin, Valentin Duruisseaux, Melvin Leok, Frank Nielsen, Mohammad Emtiyaz Khan, Mark Schmidt

Abstract: Riemannian submanifold optimization with momentum is computationally challenging because, to ensure that the iterates remain on the submanifold, we often need to solve difficult differential equations. Here, we simplify such difficulties for a class of sparse or structured symmetric positive-definite matrices with the affine-invariant metric. We do so by proposing a generalized version of the Riem… ▽ More Riemannian submanifold optimization with momentum is computationally challenging because, to ensure that the iterates remain on the submanifold, we often need to solve difficult differential equations. Here, we simplify such difficulties for a class of sparse or structured symmetric positive-definite matrices with the affine-invariant metric. We do so by proposing a generalized version of the Riemannian normal coordinates that dynamically orthonormalizes the metric and locally converts the problem into an unconstrained problem in the Euclidean space. We use our approach to simplify existing approaches for structured covariances and develop matrix-inverse-free $2^\text{nd}$-order optimizers for deep learning with low precision by using only matrix multiplications. Code: https://github.com/yorkerlin/StructuredNGD-DL △ Less

Submitted 16 March, 2024; v1 submitted 19 February, 2023; originally announced February 2023.

Comments: A long version of the ICML 2023 paper. Updated the main text to emphasize challenges of using existing Riemannian methods to estimate sparse and structured SPD matrices

arXiv:2212.09931 [pdf, other]

A Generalized Variable Importance Metric and Estimator for Black Box Machine Learning Models

Authors: Mohammad Kaviul Anam Khan, Olli Saarela, Rafal Kustra

Abstract: In this paper we define a population parameter, ``Generalized Variable Importance Metric (GVIM)'', to measure importance of predictors for black box machine learning methods, where the importance is not represented by model-based parameter. GVIM is defined for each input variable, using the true conditional expectation function, and it measures the variable's importance in affecting a continuous o… ▽ More In this paper we define a population parameter, ``Generalized Variable Importance Metric (GVIM)'', to measure importance of predictors for black box machine learning methods, where the importance is not represented by model-based parameter. GVIM is defined for each input variable, using the true conditional expectation function, and it measures the variable's importance in affecting a continuous or a binary response. We extend previously published results to show that the defined GVIM can be represented as a function of the Conditional Average Treatment Effect (CATE) for any kind of a predictor, which gives it a causal interpretation and further justification as an alternative to classical measures of significance that are only available in simple parametric models. Extensive set of simulations using realistically complex relationships between covariates and outcomes and number of regression techniques of varying degree of complexity show the performance of our proposed estimator of the GVIM. △ Less

Submitted 23 December, 2023; v1 submitted 19 December, 2022; originally announced December 2022.

arXiv:2211.11278 [pdf, ps, other]

Optimal Extended Neighbourhood Rule $k$ Nearest Neighbours Ensemble

Authors: Amjad Ali, Zardad Khan, Dost Muhammad Khan, Saeed Aldahmani

Abstract: The traditional k nearest neighbor (kNN) approach uses a distance formula within a spherical region to determine the k closest training observations to a test sample point. However, this approach may not work well when test point is located outside this region. Moreover, aggregating many base kNN learners can result in poor ensemble performance due to high classification errors. To address these i… ▽ More The traditional k nearest neighbor (kNN) approach uses a distance formula within a spherical region to determine the k closest training observations to a test sample point. However, this approach may not work well when test point is located outside this region. Moreover, aggregating many base kNN learners can result in poor ensemble performance due to high classification errors. To address these issues, a new optimal extended neighborhood rule based ensemble method is proposed in this paper. This rule determines neighbors in k steps starting from the closest sample point to the unseen observation and selecting subsequent nearest data points until the required number of observations is reached. Each base model is constructed on a bootstrap sample with a random subset of features, and optimal models are selected based on out-of-bag performance after building a sufficient number of models. The proposed ensemble is compared with state-of-the-art methods on 17 benchmark datasets using accuracy, Cohen's kappa, and Brier score (BS). The performance of the proposed method is also assessed by adding contrived features in the original data. △ Less

Submitted 15 February, 2024; v1 submitted 21 November, 2022; originally announced November 2022.

Comments: This manuscript has been submitted for publication in the esteemed journal Pattern Recognition Letters

MSC Class: 14J60

arXiv:2210.01620 [pdf, other]

SAM as an Optimal Relaxation of Bayes

Authors: Thomas Möllenhoff, Mohammad Emtiyaz Khan

Abstract: Sharpness-aware minimization (SAM) and related adversarial deep-learning methods can drastically improve generalization, but their underlying mechanisms are not yet fully understood. Here, we establish SAM as a relaxation of the Bayes objective where the expected negative-loss is replaced by the optimal convex lower bound, obtained by using the so-called Fenchel biconjugate. The connection enables… ▽ More Sharpness-aware minimization (SAM) and related adversarial deep-learning methods can drastically improve generalization, but their underlying mechanisms are not yet fully understood. Here, we establish SAM as a relaxation of the Bayes objective where the expected negative-loss is replaced by the optimal convex lower bound, obtained by using the so-called Fenchel biconjugate. The connection enables a new Adam-like extension of SAM to automatically obtain reasonable uncertainty estimates, while sometimes also improving its accuracy. By connecting adversarial and Bayesian methods, our work opens a new path to robustness. △ Less

Submitted 10 December, 2023; v1 submitted 4 October, 2022; originally announced October 2022.

Comments: Accepted at ICLR 2023. Changes: Link to source code (https://github.com/team-approx-bayes/bayesian-sam), fix a typo in Appendix D

arXiv:2208.04998 [pdf, ps, other]

Towards Enabling Next Generation Societal Virtual Reality Applications for Virtual Human Teleportation

Authors: Jacob Chakareski, Mahmudur Khan, Murat Yuksel

Abstract: Virtual reality (VR) is an emerging technology of great societal potential. Some of its most exciting and promising use cases include remote scene content and untethered lifelike navigation. This article first highlights the relevance of such future societal applications and the challenges ahead towards enabling them. It then provides a broad and contextual high-level perspective of several emergi… ▽ More Virtual reality (VR) is an emerging technology of great societal potential. Some of its most exciting and promising use cases include remote scene content and untethered lifelike navigation. This article first highlights the relevance of such future societal applications and the challenges ahead towards enabling them. It then provides a broad and contextual high-level perspective of several emerging technologies and unconventional techniques and argues that only by their synergistic integration can the fundamental performance bottlenecks of hyper-intensive computation, ultra-high data rate, and ultra-low latency be overcome to enable untethered and lifelike VR-based remote scene immersion. A novel future system concept is introduced that embodies this holistic integration, unified with a rigorous analysis, to capture the fundamental synergies and interplay between communications, computation, and signal scalability that arise in this context, and advance its performance at the same time. Several representative results highlighting these trade-offs and the benefits of the envisioned system are presented at the end. △ Less

Submitted 9 August, 2022; originally announced August 2022.

Comments: This is an extended version (with more details) of a tutorial feature article that will appear in the IEEE Signal Processing Magazine in September 2022

arXiv:2206.05764 [pdf, other]

Mining Multi-Label Samples from Single Positive Labels

Authors: Youngin Cho, Daejin Kim, Mohammad Azam Khan, Jaegul Choo

Abstract: Conditional generative adversarial networks (cGANs) have shown superior results in class-conditional generation tasks. To simultaneously control multiple conditions, cGANs require multi-label training datasets, where multiple labels can be assigned to each data instance. Nevertheless, the tremendous annotation cost limits the accessibility of multi-label datasets in real-world scenarios. Therefore… ▽ More Conditional generative adversarial networks (cGANs) have shown superior results in class-conditional generation tasks. To simultaneously control multiple conditions, cGANs require multi-label training datasets, where multiple labels can be assigned to each data instance. Nevertheless, the tremendous annotation cost limits the accessibility of multi-label datasets in real-world scenarios. Therefore, in this study we explore the practical setting called the single positive setting, where each data instance is annotated by only one positive label with no explicit negative labels. To generate multi-label data in the single positive setting, we propose a novel sampling approach called single-to-multi-label (S2M) sampling, based on the Markov chain Monte Carlo method. As a widely applicable "add-on" method, our proposed S2M sampling method enables existing unconditional and conditional GANs to draw high-quality multi-label data with a minimal annotation cost. Extensive experiments on real image datasets verify the effectiveness and correctness of our method, even when compared to a model trained with fully annotated datasets. △ Less

Submitted 28 May, 2023; v1 submitted 12 June, 2022; originally announced June 2022.

Comments: NeurIPS 2022

arXiv:2202.08146 [pdf, other]

A Prospective Approach for Human-to-Human Interaction Recognition from Wi-Fi Channel Data using Attention Bidirectional Gated Recurrent Neural Network with GUI Application Implementation

Authors: Md. Mohi Uddin Khan, Abdullah Bin Shams, Md. Mohsin Sarker Raihan

Abstract: Human Activity Recognition (HAR) research has gained significant momentum due to recent technological advancements, artificial intelligence algorithms, the need for smart cities, and socioeconomic transformation. However, existing computer vision and sensor-based HAR solutions have limitations such as privacy issues, memory and power consumption, and discomfort in wearing sensors for which researc… ▽ More Human Activity Recognition (HAR) research has gained significant momentum due to recent technological advancements, artificial intelligence algorithms, the need for smart cities, and socioeconomic transformation. However, existing computer vision and sensor-based HAR solutions have limitations such as privacy issues, memory and power consumption, and discomfort in wearing sensors for which researchers are observing a paradigm shift in HAR research. In response, WiFi-based HAR is gaining popularity due to the availability of more coarse-grained Channel State Information. However, existing WiFi-based HAR approaches are limited to classifying independent and non-concurrent human activities performed within equal time duration. Recent research commonly utilizes a Single Input Multiple Output communication link with a WiFi signal of 5 GHz channel frequency, using two WiFi routers or two Intel 5300 NICs as transmitter-receiver. Our study, on the other hand, utilizes a Multiple Input Multiple Output radio link between a WiFi router and an Intel 5300 NIC, with the time-series Wi-Fi channel state information based on 2.4 GHz channel frequency for mutual human-to-human concurrent interaction recognition. The proposed Self-Attention guided Bidirectional Gated Recurrent Neural Network (Attention-BiGRU) deep learning model can classify 13 mutual interactions with a maximum benchmark accuracy of 94% for a single subject-pair. This has been expanded for ten subject pairs, which secured a benchmark accuracy of 88% with improved classification around the interaction-transition region. An executable graphical user interface (GUI) software has also been developed in this study using the PyQt5 python module to classify, save, and display the overall mutual concurrent human interactions performed within a given time duration. ... △ Less

Submitted 9 May, 2023; v1 submitted 16 February, 2022; originally announced February 2022.

Comments: 48 Pages. This is the pre-print version article submitted for peer-review to a prestigious journal

arXiv:2112.08211 [pdf, other]

TrialGraph: Machine Intelligence Enabled Insight from Graph Modelling of Clinical Trials

Authors: Christopher Yacoumatos, Stefano Bragaglia, Anshul Kanakia, Nils Svangård, Jonathan Mangion, Claire Donoghue, Jim Weatherall, Faisal M. Khan, Khader Shameer

Abstract: A major impediment to successful drug development is the complexity, cost, and scale of clinical trials. The detailed internal structure of clinical trial data can make conventional optimization difficult to achieve. Recent advances in machine learning, specifically graph-structured data analysis, have the potential to enable significant progress in improving the clinical trial design. TrialGraph… ▽ More A major impediment to successful drug development is the complexity, cost, and scale of clinical trials. The detailed internal structure of clinical trial data can make conventional optimization difficult to achieve. Recent advances in machine learning, specifically graph-structured data analysis, have the potential to enable significant progress in improving the clinical trial design. TrialGraph seeks to apply these methodologies to produce a proof-of-concept framework for developing models which can aid drug development and benefit patients. In this work, we first introduce a curated clinical trial data set compiled from the CT.gov, AACT and TrialTrove databases (n=1191 trials; representing one million patients) and describe the conversion of this data to graph-structured formats. We then detail the mathematical basis and implementation of a selection of graph machine learning algorithms, which typically use standard machine classifiers on graph data embedded in a low-dimensional feature space. We trained these models to predict side effect information for a clinical trial given information on the disease, existing medical conditions, and treatment. The MetaPath2Vec algorithm performed exceptionally well, with standard Logistic Regression, Decision Tree, Random Forest, Support Vector, and Neural Network classifiers exhibiting typical ROC-AUC scores of 0.85, 0.68, 0.86, 0.80, and 0.77, respectively. Remarkably, the best performing classifiers could only produce typical ROC-AUC scores of 0.70 when trained on equivalent array-structured data. Our work demonstrates that graph modelling can significantly improve prediction accuracy on appropriate datasets. Successive versions of the project that refine modelling assumptions and incorporate more data types can produce excellent predictors with real-world applications in drug development. △ Less

Submitted 15 December, 2021; originally announced December 2021.

Comments: 17 pages (Manuscript); 3 pages (Supplemental Data); 9 figures

MSC Class: 68Q04; 05Cxx ACM Class: J.3.1; I.2.0; I.5.1; I.7; H.3

arXiv:2111.04892 [pdf]

Determinants of Women's Attitude towards Intimate Partner Violence: Evidence from Bangladesh

Authors: Md Tareq Ferdous Khan, Lianfen Qian

Abstract: Purpose: The purpose of this study is to identify the important determinants responsible for the variation in women's attitude towards intimate partner violence (IPV). Methods: A nationally representative Bangladesh Demographic and Health Survey 2014 data of 17,863 women is used to address the research questions. In the study, two response variables are constructed from the five attitude questions… ▽ More Purpose: The purpose of this study is to identify the important determinants responsible for the variation in women's attitude towards intimate partner violence (IPV). Methods: A nationally representative Bangladesh Demographic and Health Survey 2014 data of 17,863 women is used to address the research questions. In the study, two response variables are constructed from the five attitude questions, and a series of individual and community-level predictors are tested. The preliminary statistical methods employed in the study include univariate and bivariate distributions, while the adopted statistical models include binary logistic, ordinal logistic, mixed-effects multilevel logistic models for each response variable, and finally, the generalized ordinal logistic regression. Results: Statistical analyses reveal that among the individual-level independent variables age at first marriage, respondent's education, decision score, religion, NGO membership, access to information, husband's education, normalized wealth score, and division indicator have significant effects on the women's attitude towards IPV. Among the three community-level variables, only the mean decision score is found significant in lowering the likelihood. Conclusions: It is evident that other than religion, NGO membership, and division indicator, the higher the value of the variable, the lower the likelihood of justifying IPV. However, being a Muslim, NGO member, and resident of other divisions, women are found more tolerant of IPV from their respective counterparts. These findings suggest the government, policymakers, practitioners, academicians, and all other stakeholders to work on the significant determinants to divert women's wrong attitude towards IPV, and thus help to take away this deep-rooted problem from society. △ Less

Submitted 8 November, 2021; originally announced November 2021.

Comments: 22 pages, 6 tables, 2 figures

arXiv:2111.03412 [pdf, other]

Dual Parameterization of Sparse Variational Gaussian Processes

Authors: Vincent Adam, Paul E. Chang, Mohammad Emtiyaz Khan, Arno Solin

Abstract: Sparse variational Gaussian process (SVGP) methods are a common choice for non-conjugate Gaussian process inference because of their computational benefits. In this paper, we improve their computational efficiency by using a dual parameterization where each data example is assigned dual parameters, similarly to site parameters used in expectation propagation. Our dual parameterization speeds-up in… ▽ More Sparse variational Gaussian process (SVGP) methods are a common choice for non-conjugate Gaussian process inference because of their computational benefits. In this paper, we improve their computational efficiency by using a dual parameterization where each data example is assigned dual parameters, similarly to site parameters used in expectation propagation. Our dual parameterization speeds-up inference using natural gradient descent, and provides a tighter evidence lower bound for hyperparameter learning. The approach has the same memory cost as the current SVGP methods, but it is faster and more accurate. △ Less

Submitted 19 January, 2022; v1 submitted 5 November, 2021; originally announced November 2021.

Comments: Advances in Neural Information Processing Systems (NeurIPS 2021)

arXiv:2109.05529 [pdf]

Estimating a new panel MSK dataset for comparative analyses of national absorptive capacity systems, economic growth, and development in low and middle income economies

Authors: Muhammad Salar Khan

Abstract: Within the national innovation system literature, empirical analyses are severely lacking for developing economies. Particularly, the low- and middle-income countries (LMICs) eligible for the World Bank's International Development Association (IDA) support, are rarely part of any empirical discourse on growth, development, and innovation. One major issue hindering panel analyses in LMICs, and thus… ▽ More Within the national innovation system literature, empirical analyses are severely lacking for developing economies. Particularly, the low- and middle-income countries (LMICs) eligible for the World Bank's International Development Association (IDA) support, are rarely part of any empirical discourse on growth, development, and innovation. One major issue hindering panel analyses in LMICs, and thus them being subject to any empirical discussion, is the lack of complete data availability. This work offers a new complete panel dataset with no missing values for LMICs eligible for IDA's support. I use a standard, widely respected multiple imputation technique (specifically, Predictive Mean Matching) developed by Rubin (1987). This technique respects the structure of multivariate continuous panel data at the country level. I employ this technique to create a large dataset consisting of many variables drawn from publicly available established sources. These variables, in turn, capture six crucial country-level capacities: technological capacity, financial capacity, human capital capacity, infrastructural capacity, public policy capacity, and social capacity. Such capacities are part and parcel of the National Absorptive Capacity Systems (NACS). The dataset (MSK dataset) thus produced contains data on 47 variables for 82 LMICs between 2005 and 2019. The dataset has passed a quality and reliability check and can thus be used for comparative analyses of national absorptive capacities and development, transition, and convergence analyses among LMICs. △ Less

Submitted 12 September, 2021; originally announced September 2021.

Comments: 65 pages including figures and tables

arXiv:2108.05660 [pdf, other]

doi 10.1201/9781003256083

Development of a Risk-Free COVID-19 Screening Algorithm from Routine Blood Tests Using Ensemble Machine Learning

Authors: Md. Mohsin Sarker Raihan, Md. Mohi Uddin Khan, Laboni Akter, Abdullah Bin Shams

Abstract: The Reverse Transcription Polymerase Chain Reaction (RTPCR)} test is the silver bullet diagnostic test to discern COVID infection. Rapid antigen detection is a screening test to identify COVID positive patients in little as 15 minutes, but has a lower sensitivity than the PCR tests. Besides having multiple standardized test kits, many people are getting infected and either recovering or dying even… ▽ More The Reverse Transcription Polymerase Chain Reaction (RTPCR)} test is the silver bullet diagnostic test to discern COVID infection. Rapid antigen detection is a screening test to identify COVID positive patients in little as 15 minutes, but has a lower sensitivity than the PCR tests. Besides having multiple standardized test kits, many people are getting infected and either recovering or dying even before the test due to the shortage and cost of kits, lack of indispensable specialists and labs, time-consuming result compared to bulk population especially in developing and underdeveloped countries. Intrigued by the parametric deviations in immunological and hematological profile of a COVID patient, this research work leveraged the concept of COVID-19 detection by proposing a risk-free and highly accurate Stacked Ensemble Machine Learning model to identify a COVID patient from communally available-widespread-cheap routine blood tests which gives a promising accuracy, precision, recall and F1-score of 100%. Analysis from R-curve also shows the preciseness of the risk-free model to be implemented. The proposed method has the potential for large scale ubiquitous low-cost screening application. This can add an extra layer of protection in keeping the number of infected cases to a minimum and control the pandemic by identifying asymptomatic or pre-symptomatic people early. △ Less

Submitted 9 May, 2023; v1 submitted 12 August, 2021; originally announced August 2021.

Comments: Please read the (most updated) published version from here: https://doi.org/10.1201/9781003256083 and cite our article (Chapter-11). Video and BibTex citation format can be found in the description: https://youtu.be/Ci8dznDadJ4

Journal ref: Applied Intelligence for Industry 4.0. Chapman and Hall/CRC. 2023

arXiv:2108.01124 [pdf]

Efficacy of Statistical and Artificial Intelligence-based False Information Cyberattack Detection Models for Connected Vehicles

Authors: Sakib Mahmud Khan, Gurcan Comert, Mashrur Chowdhury

Abstract: Connected vehicles (CVs), because of the external connectivity with other CVs and connected infrastructure, are vulnerable to cyberattacks that can instantly compromise the safety of the vehicle itself and other connected vehicles and roadway infrastructure. One such cyberattack is the false information attack, where an external attacker injects inaccurate information into the connected vehicles a… ▽ More Connected vehicles (CVs), because of the external connectivity with other CVs and connected infrastructure, are vulnerable to cyberattacks that can instantly compromise the safety of the vehicle itself and other connected vehicles and roadway infrastructure. One such cyberattack is the false information attack, where an external attacker injects inaccurate information into the connected vehicles and eventually can cause catastrophic consequences by compromising safety-critical applications like the forward collision warning. The occurrence and target of such attack events can be very dynamic, making real-time and near-real-time detection challenging. Change point models, can be used for real-time anomaly detection caused by the false information attack. In this paper, we have evaluated three change point-based statistical models; Expectation Maximization, Cumulative Summation, and Bayesian Online Change Point Algorithms for cyberattack detection in the CV data. Also, data-driven artificial intelligence (AI) models, which can be used to detect known and unknown underlying patterns in the dataset, have the potential of detecting a real-time anomaly in the CV data. We have used six AI models to detect false information attacks and compared the performance for detecting the attacks with our developed change point models. Our study shows that change points models performed better in real-time false information attack detection compared to the performance of the AI models. Change point models having the advantage of no training requirements can be a feasible and computationally efficient alternative to AI models for false information attack detection in connected vehicles. △ Less

Submitted 2 August, 2021; originally announced August 2021.

Comments: 18 pages, 6 figures

arXiv:2107.10884 [pdf, other]

Structured second-order methods via natural gradient descent

Authors: Wu Lin, Frank Nielsen, Mohammad Emtiyaz Khan, Mark Schmidt

Abstract: In this paper, we propose new structured second-order methods and structured adaptive-gradient methods obtained by performing natural-gradient descent on structured parameter spaces. Natural-gradient descent is an attractive approach to design new algorithms in many settings such as gradient-free, adaptive-gradient, and second-order methods. Our structured methods not only enjoy a structural invar… ▽ More In this paper, we propose new structured second-order methods and structured adaptive-gradient methods obtained by performing natural-gradient descent on structured parameter spaces. Natural-gradient descent is an attractive approach to design new algorithms in many settings such as gradient-free, adaptive-gradient, and second-order methods. Our structured methods not only enjoy a structural invariance but also admit a simple expression. Finally, we test the efficiency of our proposed methods on both deterministic non-convex problems and deep learning problems. △ Less

Submitted 19 February, 2022; v1 submitted 22 July, 2021; originally announced July 2021.

Comments: Fixed some typos and added a new figure. ICML 2021 workshop paper. A short version of arXiv:2102.07405 with a focus on optimization tasks

arXiv:2107.08265 [pdf, other]

Subset-of-Data Variational Inference for Deep Gaussian-Processes Regression

Authors: Ayush Jain, P. K. Srijith, Mohammad Emtiyaz Khan

Abstract: Deep Gaussian Processes (DGPs) are multi-layer, flexible extensions of Gaussian processes but their training remains challenging. Sparse approximations simplify the training but often require optimization over a large number of inducing inputs and their locations across layers. In this paper, we simplify the training by setting the locations to a fixed subset of data and sampling the inducing inpu… ▽ More Deep Gaussian Processes (DGPs) are multi-layer, flexible extensions of Gaussian processes but their training remains challenging. Sparse approximations simplify the training but often require optimization over a large number of inducing inputs and their locations across layers. In this paper, we simplify the training by setting the locations to a fixed subset of data and sampling the inducing inputs from a variational distribution. This reduces the trainable parameters and computation cost without significant performance degradations, as demonstrated by our empirical results on regression problems. Our modifications simplify and stabilize DGP training while making it amenable to sampling schemes for setting the inducing inputs. △ Less

Submitted 17 July, 2021; originally announced July 2021.

Comments: Accepted in the 37th Conference on Uncertainty in Artificial Intelligence (UAI 2021)

arXiv:2107.04562 [pdf, other]

The Bayesian Learning Rule

Authors: Mohammad Emtiyaz Khan, Håvard Rue

Abstract: We show that many machine-learning algorithms are specific instances of a single algorithm called the \emph{Bayesian learning rule}. The rule, derived from Bayesian principles, yields a wide-range of algorithms from fields such as optimization, deep learning, and graphical models. This includes classical algorithms such as ridge regression, Newton's method, and Kalman filter, as well as modern dee… ▽ More We show that many machine-learning algorithms are specific instances of a single algorithm called the \emph{Bayesian learning rule}. The rule, derived from Bayesian principles, yields a wide-range of algorithms from fields such as optimization, deep learning, and graphical models. This includes classical algorithms such as ridge regression, Newton's method, and Kalman filter, as well as modern deep-learning algorithms such as stochastic-gradient descent, RMSprop, and Dropout. The key idea in deriving such algorithms is to approximate the posterior using candidate distributions estimated by using natural gradients. Different candidate distributions result in different algorithms and further approximations to natural gradients give rise to variants of those algorithms. Our work not only unifies, generalizes, and improves existing algorithms, but also helps us design new ones. △ Less

Submitted 8 June, 2024; v1 submitted 9 July, 2021; originally announced July 2021.

Journal ref: Journal of Machine Learning Research 24, no. 281 (2023): 1-46

arXiv:2106.08769 [pdf, other]

Knowledge-Adaptation Priors

Authors: Mohammad Emtiyaz Khan, Siddharth Swaroop

Abstract: Humans and animals have a natural ability to quickly adapt to their surroundings, but machine-learning models, when subjected to changes, often require a complete retraining from scratch. We present Knowledge-adaptation priors (K-priors) to reduce the cost of retraining by enabling quick and accurate adaptation for a wide-variety of tasks and models. This is made possible by a combination of weigh… ▽ More Humans and animals have a natural ability to quickly adapt to their surroundings, but machine-learning models, when subjected to changes, often require a complete retraining from scratch. We present Knowledge-adaptation priors (K-priors) to reduce the cost of retraining by enabling quick and accurate adaptation for a wide-variety of tasks and models. This is made possible by a combination of weight and function-space priors to reconstruct the gradients of the past, which recovers and generalizes many existing, but seemingly-unrelated, adaptation strategies. Training with simple first-order gradient methods can often recover the exact retrained model to an arbitrary accuracy by choosing a sufficiently large memory of the past data. Empirical results show that adaptation with K-priors achieves performance similar to full retraining, but only requires training on a handful of past examples. △ Less

Submitted 27 October, 2021; v1 submitted 16 June, 2021; originally announced June 2021.

arXiv:2106.02613 [pdf, other]

Bridging the Gap Between Target Networks and Functional Regularization

Authors: Alexandre Piché, Valentin Thomas, Rafael Pardinas, Joseph Marino, Gian Maria Marconi, Christopher Pal, Mohammad Emtiyaz Khan

Abstract: Bootstrapping is behind much of the successes of deep Reinforcement Learning. However, learning the value function via bootstrapping often leads to unstable training due to fast-changing target values. Target Networks are employed to stabilize training by using an additional set of lagging parameters to estimate the target values. Despite the popularity of Target Networks, their effect on the opti… ▽ More Bootstrapping is behind much of the successes of deep Reinforcement Learning. However, learning the value function via bootstrapping often leads to unstable training due to fast-changing target values. Target Networks are employed to stabilize training by using an additional set of lagging parameters to estimate the target values. Despite the popularity of Target Networks, their effect on the optimization is still misunderstood. In this work, we show that they act as an implicit regularizer which can be beneficial in some cases, but also have disadvantages such as being inflexible and can result in instabilities, even when vanilla TD(0) converges. To overcome these issues, we propose an explicit Functional Regularization alternative that is flexible and a convex regularizer in function space and we theoretically study its convergence. We conduct an experimental study across a range of environments, discount factors, and off-policiness data collections to investigate the effectiveness of the regularization induced by Target Networks and Functional Regularization in terms of performance, accuracy, and stability. Our findings emphasize that Functional Regularization can be used as a drop-in replacement for Target Networks and result in performance improvement. Furthermore, adjusting both the regularization weight and the network update period in Functional Regularization can result in further performance improvements compared to solely adjusting the network update period as typically done with Target Networks. Our approach also enhances the ability to networks to recover accurate $Q$-values. △ Less

Submitted 7 September, 2023; v1 submitted 4 June, 2021; originally announced June 2021.

Comments: The first two authors contributed equally

arXiv:2104.04975 [pdf, other]

Scalable Marginal Likelihood Estimation for Model Selection in Deep Learning

Authors: Alexander Immer, Matthias Bauer, Vincent Fortuin, Gunnar Rätsch, Mohammad Emtiyaz Khan

Abstract: Marginal-likelihood based model-selection, even though promising, is rarely used in deep learning due to estimation difficulties. Instead, most approaches rely on validation data, which may not be readily available. In this work, we present a scalable marginal-likelihood estimation method to select both hyperparameters and network architectures, based on the training data alone. Some hyperparamete… ▽ More Marginal-likelihood based model-selection, even though promising, is rarely used in deep learning due to estimation difficulties. Instead, most approaches rely on validation data, which may not be readily available. In this work, we present a scalable marginal-likelihood estimation method to select both hyperparameters and network architectures, based on the training data alone. Some hyperparameters can be estimated online during training, simplifying the procedure. Our marginal-likelihood estimate is based on Laplace's method and Gauss-Newton approximations to the Hessian, and it outperforms cross-validation and manual-tuning on standard regression and image classification datasets, especially in terms of calibration and out-of-distribution detection. Our work shows that marginal likelihoods can improve generalization and be useful when validation data is unavailable (e.g., in nonstationary settings). △ Less

Submitted 15 June, 2021; v1 submitted 11 April, 2021; originally announced April 2021.

Comments: ICML 2021

arXiv:2102.07405 [pdf, other]

Tractable structured natural gradient descent using local parameterizations

Authors: Wu Lin, Frank Nielsen, Mohammad Emtiyaz Khan, Mark Schmidt

Abstract: Natural-gradient descent (NGD) on structured parameter spaces (e.g., low-rank covariances) is computationally challenging due to difficult Fisher-matrix computations. We address this issue by using \emph{local-parameter coordinates} to obtain a flexible and efficient NGD method that works well for a wide-variety of structured parameterizations. We show four applications where our method (1) genera… ▽ More Natural-gradient descent (NGD) on structured parameter spaces (e.g., low-rank covariances) is computationally challenging due to difficult Fisher-matrix computations. We address this issue by using \emph{local-parameter coordinates} to obtain a flexible and efficient NGD method that works well for a wide-variety of structured parameterizations. We show four applications where our method (1) generalizes the exponential natural evolutionary strategy, (2) recovers existing Newton-like algorithms, (3) yields new structured second-order algorithms via matrix groups, and (4) gives new algorithms to learn covariances of Gaussian and Wishart-based distributions. We show results on a range of problems from deep learning, variational inference, and evolution strategies. Our work opens a new direction for scalable structured geometric methods. △ Less

Submitted 17 January, 2022; v1 submitted 15 February, 2021; originally announced February 2021.

Comments: An extended version of the ICML 2021 paper. Note: A workshop (short) paper with a focus on optimization tasks can be found at arXiv:2107.10884

arXiv:2007.04731 [pdf, other]

Fast Variational Learning in State-Space Gaussian Process Models

Authors: Paul E. Chang, William J. Wilkinson, Mohammad Emtiyaz Khan, Arno Solin

Abstract: Gaussian process (GP) regression with 1D inputs can often be performed in linear time via a stochastic differential equation formulation. However, for non-Gaussian likelihoods, this requires application of approximate inference methods which can make the implementation difficult, e.g., expectation propagation can be numerically unstable and variational inference can be computationally inefficient.… ▽ More Gaussian process (GP) regression with 1D inputs can often be performed in linear time via a stochastic differential equation formulation. However, for non-Gaussian likelihoods, this requires application of approximate inference methods which can make the implementation difficult, e.g., expectation propagation can be numerically unstable and variational inference can be computationally inefficient. In this paper, we propose a new method that removes such difficulties. Building upon an existing method called conjugate-computation variational inference, our approach enables linear-time inference via Kalman recursions while avoiding numerical instabilities and convergence issues. We provide an efficient JAX implementation which exploits just-in-time compilation and allows for fast automatic differentiation through large for-loops. Overall, our approach leads to fast and stable variational inference in state-space GP models that can be scaled to time series with millions of data points. △ Less

Submitted 17 July, 2020; v1 submitted 9 July, 2020; originally announced July 2020.

Comments: To appear in MLSP 2020

arXiv:2005.13463 [pdf, other]

Latent Racial Bias -- Evaluating Racism in Police Stop-and-Searches

Authors: Akbir Khan

Abstract: In this paper, we introduce the latent racial bias, a metric and method to evaluate the racial bias within specific events. For the purpose of this paper we explore the British Home Office dataset of stop-and-search incidents. We explore the racial bias in the choice of targets, using a number of statistical models such as graphical probabilistic and TrueSkill Ranking. Firstly, we propose a probab… ▽ More In this paper, we introduce the latent racial bias, a metric and method to evaluate the racial bias within specific events. For the purpose of this paper we explore the British Home Office dataset of stop-and-search incidents. We explore the racial bias in the choice of targets, using a number of statistical models such as graphical probabilistic and TrueSkill Ranking. Firstly, we propose a probabilistic graphical models for modelling racial bias within stop-and-searches and explore varying priors. Secondly using our inference methods, we produce a set of probability distributions for different racial/ethnic groups based on said model and data. Finally, we produce a set of examples of applications of this model, predicting biases not only for stops but also in the reactive response by law officers. △ Less

Submitted 8 May, 2020; originally announced May 2020.

arXiv:2005.12093 [pdf, ps, other]

Mixing properties of Skellam-GARCH processes

Authors: Paul Doukhan, Naushad Mamode Khan, Michael H. Neumann

Abstract: We consider integer-valued GARCH processes, where the count variable conditioned on past values of the count and state variables follows a so-called Skellam distribution. Using arguments for contractive Markov chains we prove that the process has a unique stationary regime. Furthermore, we show asymptotic regularity ($β$-mixing) with geometrically decaying coefficients for the count process. These… ▽ More We consider integer-valued GARCH processes, where the count variable conditioned on past values of the count and state variables follows a so-called Skellam distribution. Using arguments for contractive Markov chains we prove that the process has a unique stationary regime. Furthermore, we show asymptotic regularity ($β$-mixing) with geometrically decaying coefficients for the count process. These probabilistic results are complemented by a statistical analysis, a few simulations as well as an application to recent COVID-19 data. △ Less

Submitted 13 August, 2020; v1 submitted 25 May, 2020; originally announced May 2020.

MSC Class: 60G10; 60J05

arXiv:2004.14070 [pdf, other]

Continual Deep Learning by Functional Regularisation of Memorable Past

Authors: Pingbo Pan, Siddharth Swaroop, Alexander Immer, Runa Eschenhagen, Richard E. Turner, Mohammad Emtiyaz Khan

Abstract: Continually learning new skills is important for intelligent systems, yet standard deep learning methods suffer from catastrophic forgetting of the past. Recent works address this with weight regularisation. Functional regularisation, although computationally expensive, is expected to perform better, but rarely does so in practice. In this paper, we fix this issue by using a new functional-regular… ▽ More Continually learning new skills is important for intelligent systems, yet standard deep learning methods suffer from catastrophic forgetting of the past. Recent works address this with weight regularisation. Functional regularisation, although computationally expensive, is expected to perform better, but rarely does so in practice. In this paper, we fix this issue by using a new functional-regularisation approach that utilises a few memorable past examples crucial to avoid forgetting. By using a Gaussian Process formulation of deep networks, our approach enables training in weight-space while identifying both the memorable past and a functional prior. Our method achieves state-of-the-art performance on standard benchmarks and opens a new direction for life-long learning where regularisation and memory-based methods are naturally combined. △ Less

Submitted 8 January, 2021; v1 submitted 29 April, 2020; originally announced April 2020.

arXiv:2003.09018 [pdf, other]

Human Activity Recognition from Wearable Sensor Data Using Self-Attention

Authors: Saif Mahmud, M Tanjid Hasan Tonmoy, Kishor Kumar Bhaumik, A K M Mahbubur Rahman, M Ashraful Amin, Mohammad Shoyaib, Muhammad Asif Hossain Khan, Amin Ahsan Ali

Abstract: Human Activity Recognition from body-worn sensor data poses an inherent challenge in capturing spatial and temporal dependencies of time-series signals. In this regard, the existing recurrent or convolutional or their hybrid models for activity recognition struggle to capture spatio-temporal context from the feature space of sensor reading sequence. To address this complex problem, we propose a se… ▽ More Human Activity Recognition from body-worn sensor data poses an inherent challenge in capturing spatial and temporal dependencies of time-series signals. In this regard, the existing recurrent or convolutional or their hybrid models for activity recognition struggle to capture spatio-temporal context from the feature space of sensor reading sequence. To address this complex problem, we propose a self-attention based neural network model that foregoes recurrent architectures and utilizes different types of attention mechanisms to generate higher dimensional feature representation used for classification. We performed extensive experiments on four popular publicly available HAR datasets: PAMAP2, Opportunity, Skoda and USC-HAD. Our model achieve significant performance improvement over recent state-of-the-art models in both benchmark test subjects and Leave-one-subject-out evaluation. We also observe that the sensor attention maps produced by our model is able capture the importance of the modality and placement of the sensors in predicting the different activity classes. △ Less

Submitted 17 March, 2020; originally announced March 2020.

Comments: Accepted for publication at the 24th European Conference on Artificial Intelligence (ECAI-2020); 8 pages, 4 figures

arXiv:2002.12592 [pdf]

Wind Speed Prediction using Deep Ensemble Learning with a Jet-like Architecture

Authors: Aqsa Saeed Qureshi, Asifullah Khan, Muhammad Waleed Khan

Abstract: The wind is one of the most increasingly used renewable energy resources. Accurate and reliable forecast of wind speed is necessary for efficient power production; however, it is not an easy task because it depends upon meteorological features of the surrounding region. Deep learning is extensively used these days for performing feature extraction. It has also been observed that the integration of… ▽ More The wind is one of the most increasingly used renewable energy resources. Accurate and reliable forecast of wind speed is necessary for efficient power production; however, it is not an easy task because it depends upon meteorological features of the surrounding region. Deep learning is extensively used these days for performing feature extraction. It has also been observed that the integration of several learning models, known as ensemble learning, generally gives better performance compared to a single model. The design of wings, tail, and nose of a jet improves the aerodynamics resulting in a smooth and controlled flight of the jet against the variations of the air currents. Inspired by the shape and working of a jet, a novel Deep Ensemble Learning using Jet-like Architecture (DEL-Jet) technique is proposed to enhance the diversity and robustness of a learning system against the variations in the input space. The diverse feature spaces of the base-regressors are exploited using the jet-like ensemble architecture. Two Convolutional Neural Networks (as jet wings) and one deep Auto-Encoder (as jet tail) are used to extract the diverse feature spaces from the input data. After that, nonlinear PCA (as jet main body) is employed to reduce the dimensionality of extracted feature space. Finally, both the reduced and the original feature spaces are exploited to train the meta-regressor (as jet nose) for forecasting the wind speed. The performance of the proposed DEL-Jet technique is evaluated for ten independent runs and shows that the deep and jet-like architecture helps in improving the robustness and generalization of the learning system. △ Less

Submitted 20 March, 2020; v1 submitted 28 February, 2020; originally announced February 2020.

Comments: Pages: 14, Tables: 6, Figures: 3

arXiv:2002.11985 [pdf, other]

doi 10.1162/tacl_a_00413

Compressing Large-Scale Transformer-Based Models: A Case Study on BERT

Authors: Prakhar Ganesh, Yao Chen, Xin Lou, Mohammad Ali Khan, Yin Yang, Hassan Sajjad, Preslav Nakov, Deming Chen, Marianne Winslett

Abstract: Pre-trained Transformer-based models have achieved state-of-the-art performance for various Natural Language Processing (NLP) tasks. However, these models often have billions of parameters, and, thus, are too resource-hungry and computation-intensive to suit low-capability devices or applications with strict latency requirements. One potential remedy for this is model compression, which has attrac… ▽ More Pre-trained Transformer-based models have achieved state-of-the-art performance for various Natural Language Processing (NLP) tasks. However, these models often have billions of parameters, and, thus, are too resource-hungry and computation-intensive to suit low-capability devices or applications with strict latency requirements. One potential remedy for this is model compression, which has attracted a lot of research attention. Here, we summarize the research in compressing Transformers, focusing on the especially popular BERT model. In particular, we survey the state of the art in compression for BERT, we clarify the current best practices for compressing large-scale Transformer models, and we provide insights into the workings of various methods. Our categorization and analysis also shed light on promising future research directions for achieving lightweight, accurate, and generic NLP models. △ Less

Submitted 1 June, 2021; v1 submitted 27 February, 2020; originally announced February 2020.

Comments: To appear in TACL 2021. The arXiv version is a pre-MIT Press publication version

Showing 1–50 of 94 results for author: Khan, M