-
Rule-based Evolving Fuzzy System for Time Series Forecasting: New Perspectives Based on Type-2 Fuzzy Sets Measures Approach
Authors:
Eduardo Santos de Oliveira Marques,
Arthur Caio Vargas Pinto,
Kaike Sa Teles Rocha Alves,
Eduardo Pestana de Aguiar
Abstract:
Real-world data contain uncertainty and variations that can be correlated to external variables, known as randomness. An alternative cause of randomness is chaos, which can be an important component of chaotic time series. One of the existing methods to deal with this type of data is the use of the evolving Fuzzy Systems (eFSs), which have been proven to be a powerful class of models for time seri…
▽ More
Real-world data contain uncertainty and variations that can be correlated to external variables, known as randomness. An alternative cause of randomness is chaos, which can be an important component of chaotic time series. One of the existing methods to deal with this type of data is the use of the evolving Fuzzy Systems (eFSs), which have been proven to be a powerful class of models for time series forecasting, due to their autonomy to handle the data and highly complex problems in real-world applications. However, due to its working structure, type-2 fuzzy sets can outperform type-1 fuzzy sets for highly uncertain scenarios. We then propose ePL-KRLS-FSM+, an enhanced class of evolving fuzzy modeling approach that combines participatory learning (PL), a kernel recursive least squares method (KRLS), type-2 fuzzy logic and data transformation into fuzzy sets (FSs). This improvement allows to create and measure type-2 fuzzy sets for better handling uncertainties in the data, generating a model that can predict chaotic data with increased accuracy. The model is evaluated using two complex datasets: the chaotic time series Mackey-Glass delay differential equation with different degrees of chaos, and the main stock index of the Taiwan Capitalization Weighted Stock Index - TAIEX. Model performance is compared to related state-of-the-art rule-based eFS models and classical approaches and is analyzed in terms of error metrics, runtime and the number of final rules. Forecasting results show that the proposed model is competitive and performs consistently compared with type-1 models, also outperforming other forecasting methods by showing the lowest error metrics and number of final rules.
△ Less
Submitted 5 February, 2025;
originally announced February 2025.
-
On Generalization Bounds for Neural Networks with Low Rank Layers
Authors:
Andrea Pinto,
Akshay Rangamani,
Tomaso Poggio
Abstract:
While previous optimization results have suggested that deep neural networks tend to favour low-rank weight matrices, the implications of this inductive bias on generalization bounds remain underexplored. In this paper, we apply Maurer's chain rule for Gaussian complexity to analyze how low-rank layers in deep networks can prevent the accumulation of rank and dimensionality factors that typically…
▽ More
While previous optimization results have suggested that deep neural networks tend to favour low-rank weight matrices, the implications of this inductive bias on generalization bounds remain underexplored. In this paper, we apply Maurer's chain rule for Gaussian complexity to analyze how low-rank layers in deep networks can prevent the accumulation of rank and dimensionality factors that typically multiply across layers. This approach yields generalization bounds for rank and spectral norm constrained networks. We compare our results to prior generalization bounds for deep networks, highlighting how deep networks with low-rank layers can achieve better generalization than those with full-rank layers. Additionally, we discuss how this framework provides new perspectives on the generalization capabilities of deep networks exhibiting neural collapse.
△ Less
Submitted 20 November, 2024;
originally announced November 2024.
-
How Neural Networks Learn the Support is an Implicit Regularization Effect of SGD
Authors:
Pierfrancesco Beneventano,
Andrea Pinto,
Tomaso Poggio
Abstract:
We investigate the ability of deep neural networks to identify the support of the target function. Our findings reveal that mini-batch SGD effectively learns the support in the first layer of the network by shrinking to zero the weights associated with irrelevant components of input. In contrast, we demonstrate that while vanilla GD also approximates the target function, it requires an explicit re…
▽ More
We investigate the ability of deep neural networks to identify the support of the target function. Our findings reveal that mini-batch SGD effectively learns the support in the first layer of the network by shrinking to zero the weights associated with irrelevant components of input. In contrast, we demonstrate that while vanilla GD also approximates the target function, it requires an explicit regularization term to learn the support in the first layer. We prove that this property of mini-batch SGD is due to a second-order implicit regularization effect which is proportional to $η/ b$ (step size / batch size). Our results are not only another proof that implicit regularization has a significant impact on training optimization dynamics but they also shed light on the structure of the features that are learned by the network. Additionally, they suggest that smaller batches enhance feature interpretability and reduce dependency on initialization.
△ Less
Submitted 16 June, 2024;
originally announced June 2024.
-
Scaling Vision with Sparse Mixture of Experts
Authors:
Carlos Riquelme,
Joan Puigcerver,
Basil Mustafa,
Maxim Neumann,
Rodolphe Jenatton,
André Susano Pinto,
Daniel Keysers,
Neil Houlsby
Abstract:
Sparsely-gated Mixture of Experts networks (MoEs) have demonstrated excellent scalability in Natural Language Processing. In Computer Vision, however, almost all performant networks are "dense", that is, every input is processed by every parameter. We present a Vision MoE (V-MoE), a sparse version of the Vision Transformer, that is scalable and competitive with the largest dense networks. When app…
▽ More
Sparsely-gated Mixture of Experts networks (MoEs) have demonstrated excellent scalability in Natural Language Processing. In Computer Vision, however, almost all performant networks are "dense", that is, every input is processed by every parameter. We present a Vision MoE (V-MoE), a sparse version of the Vision Transformer, that is scalable and competitive with the largest dense networks. When applied to image recognition, V-MoE matches the performance of state-of-the-art networks, while requiring as little as half of the compute at inference time. Further, we propose an extension to the routing algorithm that can prioritize subsets of each input across the entire batch, leading to adaptive per-image compute. This allows V-MoE to trade-off performance and compute smoothly at test-time. Finally, we demonstrate the potential of V-MoE to scale vision models, and train a 15B parameter model that attains 90.35% on ImageNet.
△ Less
Submitted 10 June, 2021;
originally announced June 2021.
-
Deep Ensembles for Low-Data Transfer Learning
Authors:
Basil Mustafa,
Carlos Riquelme,
Joan Puigcerver,
André Susano Pinto,
Daniel Keysers,
Neil Houlsby
Abstract:
In the low-data regime, it is difficult to train good supervised models from scratch. Instead practitioners turn to pre-trained models, leveraging transfer learning. Ensembling is an empirically and theoretically appealing way to construct powerful predictive models, but the predominant approach of training multiple deep networks with different random initialisations collides with the need for tra…
▽ More
In the low-data regime, it is difficult to train good supervised models from scratch. Instead practitioners turn to pre-trained models, leveraging transfer learning. Ensembling is an empirically and theoretically appealing way to construct powerful predictive models, but the predominant approach of training multiple deep networks with different random initialisations collides with the need for transfer via pre-trained weights. In this work, we study different ways of creating ensembles from pre-trained models. We show that the nature of pre-training itself is a performant source of diversity, and propose a practical algorithm that efficiently identifies a subset of pre-trained models for any downstream dataset. The approach is simple: Use nearest-neighbour accuracy to rank pre-trained models, fine-tune the best ones with a small hyperparameter sweep, and greedily construct an ensemble to minimise validation cross-entropy. When evaluated together with strong baselines on 19 different downstream tasks (the Visual Task Adaptation Benchmark), this achieves state-of-the-art performance at a much lower inference budget, even when selecting from over 2,000 pre-trained models. We also assess our ensembles on ImageNet variants and show improved robustness to distribution shift.
△ Less
Submitted 19 October, 2020; v1 submitted 14 October, 2020;
originally announced October 2020.
-
Scalable Transfer Learning with Expert Models
Authors:
Joan Puigcerver,
Carlos Riquelme,
Basil Mustafa,
Cedric Renggli,
André Susano Pinto,
Sylvain Gelly,
Daniel Keysers,
Neil Houlsby
Abstract:
Transfer of pre-trained representations can improve sample efficiency and reduce computational requirements for new tasks. However, representations used for transfer are usually generic, and are not tailored to a particular distribution of downstream tasks. We explore the use of expert representations for transfer with a simple, yet effective, strategy. We train a diverse set of experts by exploit…
▽ More
Transfer of pre-trained representations can improve sample efficiency and reduce computational requirements for new tasks. However, representations used for transfer are usually generic, and are not tailored to a particular distribution of downstream tasks. We explore the use of expert representations for transfer with a simple, yet effective, strategy. We train a diverse set of experts by exploiting existing label structures, and use cheap-to-compute performance proxies to select the relevant expert for each target task. This strategy scales the process of transferring to new tasks, since it does not revisit the pre-training data during transfer. Accordingly, it requires little extra compute per target task, and results in a speed-up of 2-3 orders of magnitude compared to competing approaches. Further, we provide an adapter-based architecture able to compress many experts into a single model. We evaluate our approach on two different data sources and demonstrate that it outperforms baselines on over 20 diverse vision tasks in both cases.
△ Less
Submitted 28 September, 2020;
originally announced September 2020.
-
A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark
Authors:
Xiaohua Zhai,
Joan Puigcerver,
Alexander Kolesnikov,
Pierre Ruyssen,
Carlos Riquelme,
Mario Lucic,
Josip Djolonga,
Andre Susano Pinto,
Maxim Neumann,
Alexey Dosovitskiy,
Lucas Beyer,
Olivier Bachem,
Michael Tschannen,
Marcin Michalski,
Olivier Bousquet,
Sylvain Gelly,
Neil Houlsby
Abstract:
Representation learning promises to unlock deep learning for the long tail of vision tasks without expensive labelled datasets. Yet, the absence of a unified evaluation for general visual representations hinders progress. Popular protocols are often too constrained (linear classification), limited in diversity (ImageNet, CIFAR, Pascal-VOC), or only weakly related to representation quality (ELBO, r…
▽ More
Representation learning promises to unlock deep learning for the long tail of vision tasks without expensive labelled datasets. Yet, the absence of a unified evaluation for general visual representations hinders progress. Popular protocols are often too constrained (linear classification), limited in diversity (ImageNet, CIFAR, Pascal-VOC), or only weakly related to representation quality (ELBO, reconstruction error). We present the Visual Task Adaptation Benchmark (VTAB), which defines good representations as those that adapt to diverse, unseen tasks with few examples. With VTAB, we conduct a large-scale study of many popular publicly-available representation learning algorithms. We carefully control confounders such as architecture and tuning budget. We address questions like: How effective are ImageNet representations beyond standard natural datasets? How do representations trained via generative and discriminative models compare? To what extent can self-supervision replace labels? And, how close are we to general visual representations?
△ Less
Submitted 21 February, 2020; v1 submitted 1 October, 2019;
originally announced October 2019.