-
Accounting for Variance in Machine Learning Benchmarks
Authors:
Xavier Bouthillier,
Pierre Delaunay,
Mirko Bronzi,
Assya Trofimov,
Brennan Nichyporuk,
Justin Szeto,
Naz Sepah,
Edward Raff,
Kanika Madan,
Vikram Voleti,
Samira Ebrahimi Kahou,
Vincent Michalski,
Dmitriy Serdyuk,
Tal Arbel,
Chris Pal,
Gaël Varoquaux,
Pascal Vincent
Abstract:
Strong empirical evidence that one machine-learning algorithm A outperforms another one B ideally calls for multiple trials optimizing the learning pipeline over sources of variation such as data sampling, data augmentation, parameter initialization, and hyperparameters choices. This is prohibitively expensive, and corners are cut to reach conclusions. We model the whole benchmarking process, reve…
▽ More
Strong empirical evidence that one machine-learning algorithm A outperforms another one B ideally calls for multiple trials optimizing the learning pipeline over sources of variation such as data sampling, data augmentation, parameter initialization, and hyperparameters choices. This is prohibitively expensive, and corners are cut to reach conclusions. We model the whole benchmarking process, revealing that variance due to data sampling, parameter initialization and hyperparameter choice impact markedly the results. We analyze the predominant comparison methods used today in the light of this variance. We show a counter-intuitive result that adding more sources of variation to an imperfect estimator approaches better the ideal estimator at a 51 times reduction in compute cost. Building on these results, we study the error rate of detecting improvements, on five different deep-learning tasks/architectures. This study leads us to propose recommendations for performance comparisons.
△ Less
Submitted 1 March, 2021;
originally announced March 2021.
-
Fast Approximate Natural Gradient Descent in a Kronecker-factored Eigenbasis
Authors:
Thomas George,
César Laurent,
Xavier Bouthillier,
Nicolas Ballas,
Pascal Vincent
Abstract:
Optimization algorithms that leverage gradient covariance information, such as variants of natural gradient descent (Amari, 1998), offer the prospect of yielding more effective descent directions. For models with many parameters, the covariance matrix they are based on becomes gigantic, making them inapplicable in their original form. This has motivated research into both simple diagonal approxima…
▽ More
Optimization algorithms that leverage gradient covariance information, such as variants of natural gradient descent (Amari, 1998), offer the prospect of yielding more effective descent directions. For models with many parameters, the covariance matrix they are based on becomes gigantic, making them inapplicable in their original form. This has motivated research into both simple diagonal approximations and more sophisticated factored approximations such as KFAC (Heskes, 2000; Martens & Grosse, 2015; Grosse & Martens, 2016). In the present work we draw inspiration from both to propose a novel approximation that is provably better than KFAC and amendable to cheap partial updates. It consists in tracking a diagonal variance, not in parameter coordinates, but in a Kronecker-factored eigenbasis, in which the diagonal approximation is likely to be more effective. Experiments show improvements over KFAC in optimization speed for several deep network architectures.
△ Less
Submitted 26 July, 2021; v1 submitted 11 June, 2018;
originally announced June 2018.
-
Dropout as data augmentation
Authors:
Xavier Bouthillier,
Kishore Konda,
Pascal Vincent,
Roland Memisevic
Abstract:
Dropout is typically interpreted as bagging a large number of models sharing parameters. We show that using dropout in a network can also be interpreted as a kind of data augmentation in the input space without domain knowledge. We present an approach to projecting the dropout noise within a network back into the input space, thereby generating augmented versions of the training data, and we show…
▽ More
Dropout is typically interpreted as bagging a large number of models sharing parameters. We show that using dropout in a network can also be interpreted as a kind of data augmentation in the input space without domain knowledge. We present an approach to projecting the dropout noise within a network back into the input space, thereby generating augmented versions of the training data, and we show that training a deterministic network on the augmented samples yields similar results. Finally, we propose a new dropout noise scheme based on our observations and show that it improves dropout results without adding significant computational cost.
△ Less
Submitted 7 January, 2016; v1 submitted 29 June, 2015;
originally announced June 2015.