-
ReZero is All You Need: Fast Convergence at Large Depth
Authors:
Thomas Bachlechner,
Bodhisattwa Prasad Majumder,
Huanru Henry Mao,
Garrison W. Cottrell,
Julian McAuley
Abstract:
Deep networks often suffer from vanishing or exploding gradients due to inefficient signal propagation, leading to long training times or convergence difficulties. Various architecture designs, sophisticated residual-style networks, and initialization schemes have been shown to improve deep signal propagation. Recently, Pennington et al. used free probability theory to show that dynamical isometry…
▽ More
Deep networks often suffer from vanishing or exploding gradients due to inefficient signal propagation, leading to long training times or convergence difficulties. Various architecture designs, sophisticated residual-style networks, and initialization schemes have been shown to improve deep signal propagation. Recently, Pennington et al. used free probability theory to show that dynamical isometry plays an integral role in efficient deep learning. We show that the simplest architecture change of gating each residual connection using a single zero-initialized parameter satisfies initial dynamical isometry and outperforms more complex approaches. Although much simpler than its predecessors, this gate enables training thousands of fully connected layers with fast convergence and better test performance for ResNets trained on CIFAR-10. We apply this technique to language modeling and find that we can easily train 120-layer Transformers. When applied to 12 layer Transformers, it converges 56% faster on enwiki8.
△ Less
Submitted 24 June, 2020; v1 submitted 10 March, 2020;
originally announced March 2020.
-
Improving Neural Story Generation by Targeted Common Sense Grounding
Authors:
Huanru Henry Mao,
Bodhisattwa Prasad Majumder,
Julian McAuley,
Garrison W. Cottrell
Abstract:
Stories generated with neural language models have shown promise in grammatical and stylistic consistency. However, the generated stories are still lacking in common sense reasoning, e.g., they often contain sentences deprived of world knowledge. We propose a simple multi-task learning scheme to achieve quantitatively better common sense reasoning in language models by leveraging auxiliary trainin…
▽ More
Stories generated with neural language models have shown promise in grammatical and stylistic consistency. However, the generated stories are still lacking in common sense reasoning, e.g., they often contain sentences deprived of world knowledge. We propose a simple multi-task learning scheme to achieve quantitatively better common sense reasoning in language models by leveraging auxiliary training signals from datasets designed to provide common sense grounding. When combined with our two-stage fine-tuning pipeline, our method achieves improved common sense reasoning and state-of-the-art perplexity on the Writing Prompts (Fan et al., 2018) story generation dataset.
△ Less
Submitted 27 February, 2020; v1 submitted 25 August, 2019;
originally announced August 2019.
-
Kernelized Weighted SUSAN based Fuzzy C-Means Clustering for Noisy Image Segmentation
Authors:
Satrajit Mukherjee,
Bodhisattwa Prasad Majumder,
Aritran Piplai,
Swagatam Das
Abstract:
The paper proposes a novel Kernelized image segmentation scheme for noisy images that utilizes the concept of Smallest Univalue Segment Assimilating Nucleus (SUSAN) and incorporates spatial constraints by computing circular colour map induced weights. Fuzzy damping coefficients are obtained for each nucleus or center pixel on the basis of the corresponding weighted SUSAN area values, the weights b…
▽ More
The paper proposes a novel Kernelized image segmentation scheme for noisy images that utilizes the concept of Smallest Univalue Segment Assimilating Nucleus (SUSAN) and incorporates spatial constraints by computing circular colour map induced weights. Fuzzy damping coefficients are obtained for each nucleus or center pixel on the basis of the corresponding weighted SUSAN area values, the weights being equal to the inverse of the number of horizontal and vertical moves required to reach a neighborhood pixel from the center pixel. These weights are used to vary the contributions of the different nuclei in the Kernel based framework. The paper also presents an edge quality metric obtained by fuzzy decision based edge candidate selection and final computation of the blurriness of the edges after their selection. The inability of existing algorithms to preserve edge information and structural details in their segmented maps necessitates the computation of the edge quality factor (EQF) for all the competing algorithms. Qualitative and quantitative analysis have been rendered with respect to state-of-the-art algorithms and for images ridden with varying types of noises. Speckle noise ridden SAR images and Rician noise ridden Magnetic Resonance Images have also been considered for evaluating the effectiveness of the proposed algorithm in extracting important segmentation information.
△ Less
Submitted 28 March, 2016;
originally announced March 2016.