-
SentenceMIM: A Latent Variable Language Model
Authors:
Micha Livne,
Kevin Swersky,
David J. Fleet
Abstract:
SentenceMIM is a probabilistic auto-encoder for language data, trained with Mutual Information Machine (MIM) learning to provide a fixed length representation of variable length language observations (i.e., similar to VAE). Previous attempts to learn VAEs for language data faced challenges due to posterior collapse. MIM learning encourages high mutual information between observations and latent va…
▽ More
SentenceMIM is a probabilistic auto-encoder for language data, trained with Mutual Information Machine (MIM) learning to provide a fixed length representation of variable length language observations (i.e., similar to VAE). Previous attempts to learn VAEs for language data faced challenges due to posterior collapse. MIM learning encourages high mutual information between observations and latent variables, and is robust against posterior collapse. As such, it learns informative representations whose dimension can be an order of magnitude higher than existing language VAEs. Importantly, the SentenceMIM loss has no hyper-parameters, simplifying optimization. We compare sentenceMIM with VAE, and AE on multiple datasets. SentenceMIM yields excellent reconstruction, comparable to AEs, with a rich structured latent space, comparable to VAEs. The structured latent representation is demonstrated with interpolation between sentences of different lengths. We demonstrate the versatility of sentenceMIM by utilizing a trained model for question-answering and transfer learning, without fine-tuning, outperforming VAE and AE with similar architectures.
△ Less
Submitted 21 April, 2021; v1 submitted 18 February, 2020;
originally announced March 2020.
-
High Mutual Information in Representation Learning with Symmetric Variational Inference
Authors:
Micha Livne,
Kevin Swersky,
David J. Fleet
Abstract:
We introduce the Mutual Information Machine (MIM), a novel formulation of representation learning, using a joint distribution over the observations and latent state in an encoder/decoder framework. Our key principles are symmetry and mutual information, where symmetry encourages the encoder and decoder to learn different factorizations of the same underlying distribution, and mutual information, t…
▽ More
We introduce the Mutual Information Machine (MIM), a novel formulation of representation learning, using a joint distribution over the observations and latent state in an encoder/decoder framework. Our key principles are symmetry and mutual information, where symmetry encourages the encoder and decoder to learn different factorizations of the same underlying distribution, and mutual information, to encourage the learning of useful representations for downstream tasks. Our starting point is the symmetric Jensen-Shannon divergence between the encoding and decoding joint distributions, plus a mutual information encouraging regularizer. We show that this can be bounded by a tractable cross entropy loss function between the true model and a parameterized approximation, and relate this to the maximum likelihood framework. We also relate MIM to variational autoencoders (VAEs) and demonstrate that MIM is capable of learning symmetric factorizations, with high mutual information that avoids posterior collapse.
△ Less
Submitted 3 October, 2019;
originally announced October 2019.
-
MIM: Mutual Information Machine
Authors:
Micha Livne,
Kevin Swersky,
David J. Fleet
Abstract:
We introduce the Mutual Information Machine (MIM), a probabilistic auto-encoder for learning joint distributions over observations and latent variables. MIM reflects three design principles: 1) low divergence, to encourage the encoder and decoder to learn consistent factorizations of the same underlying distribution; 2) high mutual information, to encourage an informative relation between data and…
▽ More
We introduce the Mutual Information Machine (MIM), a probabilistic auto-encoder for learning joint distributions over observations and latent variables. MIM reflects three design principles: 1) low divergence, to encourage the encoder and decoder to learn consistent factorizations of the same underlying distribution; 2) high mutual information, to encourage an informative relation between data and latent variables; and 3) low marginal entropy, or compression, which tends to encourage clustered latent representations. We show that a combination of the Jensen-Shannon divergence and the joint entropy of the encoding and decoding distributions satisfies these criteria, and admits a tractable cross-entropy bound that can be optimized directly with Monte Carlo and stochastic gradient descent. We contrast MIM learning with maximum likelihood and VAEs. Experiments show that MIM learns representations with high mutual information, consistent encoding and decoding distributions, effective latent clustering, and data log likelihood comparable to VAE, while avoiding posterior collapse.
△ Less
Submitted 21 February, 2020; v1 submitted 7 October, 2019;
originally announced October 2019.
-
TzK: Flow-Based Conditional Generative Model
Authors:
Micha Livne,
David Fleet
Abstract:
We formulate a new class of conditional generative models based on probability flows. Trained with maximum likelihood, it provides efficient inference and sampling from class-conditionals or the joint distribution, and does not require a priori knowledge of the number of classes or the relationships between classes. This allows one to train generative models from multiple, heterogeneous datasets,…
▽ More
We formulate a new class of conditional generative models based on probability flows. Trained with maximum likelihood, it provides efficient inference and sampling from class-conditionals or the joint distribution, and does not require a priori knowledge of the number of classes or the relationships between classes. This allows one to train generative models from multiple, heterogeneous datasets, while retaining strong prior models over subsets of the data (e.g., from a single dataset, class label, or attribute). In this paper, in addition to end-to-end learning, we show how one can learn a single model from multiple datasets with a relatively weak Glow architecture, and then extend it by conditioning on different knowledge types (e.g., a single dataset). This yields log likelihood comparable to state-of-the-art, compelling samples from conditional priors.
△ Less
Submitted 22 April, 2019; v1 submitted 5 February, 2019;
originally announced February 2019.
-
TzK Flow - Conditional Generative Model
Authors:
Micha Livne,
David J. Fleet
Abstract:
We introduce TzK (pronounced "task"), a conditional probability flow-based model that exploits attributes (e.g., style, class membership, or other side information) in order to learn tight conditional prior around manifolds of the target observations. The model is trained via approximated ML, and offers efficient approximation of arbitrary data sample distributions (similar to GAN and flow-based M…
▽ More
We introduce TzK (pronounced "task"), a conditional probability flow-based model that exploits attributes (e.g., style, class membership, or other side information) in order to learn tight conditional prior around manifolds of the target observations. The model is trained via approximated ML, and offers efficient approximation of arbitrary data sample distributions (similar to GAN and flow-based ML), and stable training (similar to VAE and ML), while avoiding variational approximations. TzK exploits meta-data to facilitate a bottleneck, similar to autoencoders, thereby producing a low-dimensional representation. Unlike autoencoders, the bottleneck does not limit model expressiveness, similar to flow-based ML. Supervised, unsupervised, and semi-supervised learning are supported by replacing missing observations with samples from learned priors. We demonstrate TzK by training jointly on MNIST and Omniglot datasets with minimal preprocessing, and weak supervision, with results comparable to state-of-the-art.
△ Less
Submitted 19 February, 2019; v1 submitted 5 November, 2018;
originally announced November 2018.
-
Deep learning to achieve clinically applicable segmentation of head and neck anatomy for radiotherapy
Authors:
Stanislav Nikolov,
Sam Blackwell,
Alexei Zverovitch,
Ruheena Mendes,
Michelle Livne,
Jeffrey De Fauw,
Yojan Patel,
Clemens Meyer,
Harry Askham,
Bernardino Romera-Paredes,
Christopher Kelly,
Alan Karthikesalingam,
Carlton Chu,
Dawn Carnell,
Cheng Boon,
Derek D'Souza,
Syed Ali Moinuddin,
Bethany Garie,
Yasmin McQuinlan,
Sarah Ireland,
Kiarna Hampton,
Krystle Fuller,
Hugh Montgomery,
Geraint Rees,
Mustafa Suleyman
, et al. (4 additional authors not shown)
Abstract:
Over half a million individuals are diagnosed with head and neck cancer each year worldwide. Radiotherapy is an important curative treatment for this disease, but it requires manual time consuming delineation of radio-sensitive organs at risk (OARs). This planning process can delay treatment, while also introducing inter-operator variability with resulting downstream radiation dose differences. Wh…
▽ More
Over half a million individuals are diagnosed with head and neck cancer each year worldwide. Radiotherapy is an important curative treatment for this disease, but it requires manual time consuming delineation of radio-sensitive organs at risk (OARs). This planning process can delay treatment, while also introducing inter-operator variability with resulting downstream radiation dose differences. While auto-segmentation algorithms offer a potentially time-saving solution, the challenges in defining, quantifying and achieving expert performance remain. Adopting a deep learning approach, we demonstrate a 3D U-Net architecture that achieves expert-level performance in delineating 21 distinct head and neck OARs commonly segmented in clinical practice. The model was trained on a dataset of 663 deidentified computed tomography (CT) scans acquired in routine clinical practice and with both segmentations taken from clinical practice and segmentations created by experienced radiographers as part of this research, all in accordance with consensus OAR definitions. We demonstrate the model's clinical applicability by assessing its performance on a test set of 21 CT scans from clinical practice, each with the 21 OARs segmented by two independent experts. We also introduce surface Dice similarity coefficient (surface DSC), a new metric for the comparison of organ delineation, to quantify deviation between OAR surface contours rather than volumes, better reflecting the clinical task of correcting errors in the automated organ segmentations. The model's generalisability is then demonstrated on two distinct open source datasets, reflecting different centres and countries to model training. With appropriate validation studies and regulatory approvals, this system could improve the efficiency, consistency, and safety of radiotherapy pathways.
△ Less
Submitted 13 January, 2021; v1 submitted 12 September, 2018;
originally announced September 2018.