-
High Quality Diffusion Distillation on a Single GPU with Relative and Absolute Position Matching
Authors:
Guoqiang Zhang,
Kenta Niwa,
J. P. Lewis,
Cedric Mesnage,
W. Bastiaan Kleijn
Abstract:
We introduce relative and absolute position matching (RAPM), a diffusion distillation method resulting in high quality generation that can be trained efficiently on a single GPU. Recent diffusion distillation research has achieved excellent results for high-resolution text-to-image generation with methods such as phased consistency models (PCM) and improved distribution matching distillation (DMD2…
▽ More
We introduce relative and absolute position matching (RAPM), a diffusion distillation method resulting in high quality generation that can be trained efficiently on a single GPU. Recent diffusion distillation research has achieved excellent results for high-resolution text-to-image generation with methods such as phased consistency models (PCM) and improved distribution matching distillation (DMD2). However, these methods generally require many GPUs (e.g.~8-64) and significant batchsizes (e.g.~128-2048) during training, resulting in memory and compute requirements that are beyond the resources of some researchers. RAPM provides effective single-GPU diffusion distillation training with a batchsize of 1. The new method attempts to mimic the sampling trajectories of the teacher model by matching the relative and absolute positions. The design of relative positions is inspired by PCM. Two discriminators are introduced accordingly in RAPM, one for matching relative positions and the other for absolute positions. Experimental results on StableDiffusion (SD) V1.5 and SDXL indicate that RAPM with 4 timesteps produces comparable FID scores as the best method with 1 timestep under very limited computational resources.
△ Less
Submitted 26 March, 2025;
originally announced March 2025.
-
Zero-Shot Mono-to-Binaural Speech Synthesis
Authors:
Alon Levkovitch,
Julian Salazar,
Soroosh Mariooryad,
RJ Skerry-Ryan,
Nadav Bar,
Bastiaan Kleijn,
Eliya Nachmani
Abstract:
We present ZeroBAS, a neural method to synthesize binaural audio from monaural audio recordings and positional information without training on any binaural data. To our knowledge, this is the first published zero-shot neural approach to mono-to-binaural audio synthesis. Specifically, we show that a parameter-free geometric time warping and amplitude scaling based on source location suffices to get…
▽ More
We present ZeroBAS, a neural method to synthesize binaural audio from monaural audio recordings and positional information without training on any binaural data. To our knowledge, this is the first published zero-shot neural approach to mono-to-binaural audio synthesis. Specifically, we show that a parameter-free geometric time warping and amplitude scaling based on source location suffices to get an initial binaural synthesis that can be refined by iteratively applying a pretrained denoising vocoder. Furthermore, we find this leads to generalization across room conditions, which we measure by introducing a new dataset, TUT Mono-to-Binaural, to evaluate state-of-the-art monaural-to-binaural synthesis methods on unseen conditions. Our zero-shot method is perceptually on-par with the performance of supervised methods on the standard mono-to-binaural dataset, and even surpasses them on our out-of-distribution TUT Mono-to-Binaural dataset. Our results highlight the potential of pretrained generative audio models and zero-shot learning to unlock robust binaural audio synthesis.
△ Less
Submitted 28 May, 2025; v1 submitted 11 December, 2024;
originally announced December 2024.
-
Existence and phase structure of random inverse limit measures
Authors:
B. J. K. Kleijn
Abstract:
Analogous to Kolmogorov's theorem for the existence of stochastic processes describing random functions, we consider theorems for the existence of stochastic processes describing random measures, as limits of inverse measure systems. Specifically, given a coherent inverse system of random (bounded/signed/positive/probability) histograms on refining partitions, we study conditions for the existence…
▽ More
Analogous to Kolmogorov's theorem for the existence of stochastic processes describing random functions, we consider theorems for the existence of stochastic processes describing random measures, as limits of inverse measure systems. Specifically, given a coherent inverse system of random (bounded/signed/positive/probability) histograms on refining partitions, we study conditions for the existence and uniqueness of a corresponding random inverse limit, a Radon probability measure on the space of (bounded/signed/positive/probability) measures. Depending on the topology (vague/tight/weak/total-variational) and Kingman's notion of complete randomness, the limiting random measure is in one of four phases, distinguished by their degrees of concentration (support/domination/discreteness). Results are applied in the well-known Dirichlet and Polya tree families of random probability measures and in a new Gaussian family of signed inverse limit measures. In these three families, examples of all four phases occur and we describe the corresponding conditions on defining parameters.
△ Less
Submitted 15 May, 2025; v1 submitted 19 August, 2024;
originally announced August 2024.
-
On Exact Bit-level Reversible Transformers Without Changing Architectures
Authors:
Guoqiang Zhang,
J. P. Lewis,
W. B. Kleijn
Abstract:
Various reversible deep neural networks (DNN) models have been proposed to reduce memory consumption in the training process. However, almost all existing reversible DNNs either require special non-standard architectures or are constructed by modifying existing DNN architectures considerably to enable reversibility. In this work we present the BDIA-transformer, which is an exact bit-level reversib…
▽ More
Various reversible deep neural networks (DNN) models have been proposed to reduce memory consumption in the training process. However, almost all existing reversible DNNs either require special non-standard architectures or are constructed by modifying existing DNN architectures considerably to enable reversibility. In this work we present the BDIA-transformer, which is an exact bit-level reversible transformer that uses an unchanged standard architecture for inference. The basic idea is to first treat each transformer block as the Euler integration approximation for solving an ordinary differential equation (ODE) and then incorporate the technique of bidirectional integration approximation (BDIA) into the neural architecture, together with activation quantization to make it exactly bit-level reversible. In the training process, we let a hyper-parameter $γ$ in BDIA-transformer randomly take one of the two values $\{0.5, -0.5\}$ per training sample per transformer block for averaging every two consecutive integration approximations. As a result, BDIA-transformer can be viewed as training an ensemble of ODE solvers parameterized by a set of binary random variables, which regularizes the model and results in improved validation accuracy. Lightweight side information per transformer block is required to be stored in the forward process to account for binary quantization loss to enable exact bit-level reversibility. In the inference procedure, the expectation $\mathbb{E}(γ)=0$ is taken to make the resulting architectures of BDIA-transformer identical to transformers up to activation quantization. Our experiments in both image classification and language translation show that BDIA-transformers outperform their conventional counterparts significantly in terms of validation performance while also requiring considerably less training memory.
△ Less
Submitted 5 October, 2024; v1 submitted 12 July, 2024;
originally announced July 2024.
-
Contiguity and remote contiguity of some random graphs
Authors:
B. J. K. Kleijn,
S. Rizzelli
Abstract:
Asymptotic properties of random graph sequences, like occurrence of a giant component or full connectivity in Erdős-Rényi graphs, are usually derived with very specific choices for defining parameters. The question arises to which extent those parameters choices may be perturbed, without losing the asymptotic property. Writing $(P_n)$ and $(Q_n)$ for two sequences of graph distributions, asymptoti…
▽ More
Asymptotic properties of random graph sequences, like occurrence of a giant component or full connectivity in Erdős-Rényi graphs, are usually derived with very specific choices for defining parameters. The question arises to which extent those parameters choices may be perturbed, without losing the asymptotic property. Writing $(P_n)$ and $(Q_n)$ for two sequences of graph distributions, asymptotic equivalence (convergence in total-variation) and contiguity ($P_n(A_n)=o(1) \implies Q_n(A_n)=o(1)$) have been considered by (Janson, 2010) and others; here we use so-called remote contiguity (for some fixed $a_n\downarrow 0$, $P_n(A_n)=o(a_n) \implies Q_n(A_n)=o(1)$) to show that connectivity properties are preserved in more heavily perturbed Erdős-Rényi graphs. The techniques we demonstrate with random graphs here, extend to general asymptotic properties, e.g. in more complex large-graph limits, scaling limits, large-sample limits, etc.
△ Less
Submitted 17 February, 2024;
originally announced February 2024.
-
TrailBlazer: Trajectory Control for Diffusion-Based Video Generation
Authors:
Wan-Duo Kurt Ma,
J. P. Lewis,
W. Bastiaan Kleijn
Abstract:
Within recent approaches to text-to-video (T2V) generation, achieving controllability in the synthesized video is often a challenge. Typically, this issue is addressed by providing low-level per-frame guidance in the form of edge maps, depth maps, or an existing video to be altered. However, the process of obtaining such guidance can be labor-intensive. This paper focuses on enhancing controllabil…
▽ More
Within recent approaches to text-to-video (T2V) generation, achieving controllability in the synthesized video is often a challenge. Typically, this issue is addressed by providing low-level per-frame guidance in the form of edge maps, depth maps, or an existing video to be altered. However, the process of obtaining such guidance can be labor-intensive. This paper focuses on enhancing controllability in video synthesis by employing straightforward bounding boxes to guide the subject in various ways, all without the need for neural network training, finetuning, optimization at inference time, or the use of pre-existing videos. Our algorithm, TrailBlazer, is constructed upon a pre-trained (T2V) model, and easy to implement. The subject is directed by a bounding box through the proposed spatial and temporal attention map editing. Moreover, we introduce the concept of keyframing, allowing the subject trajectory and overall appearance to be guided by both a moving bounding box and corresponding prompts, without the need to provide a detailed mask. The method is efficient, with negligible additional computation relative to the underlying pre-trained model. Despite the simplicity of the bounding box guidance, the resulting motion is surprisingly natural, with emergent effects including perspective and movement toward the virtual camera as the box size increases.
△ Less
Submitted 8 April, 2024; v1 submitted 31 December, 2023;
originally announced January 2024.
-
Exact Diffusion Inversion via Bi-directional Integration Approximation
Authors:
Guoqiang Zhang,
J. P. Lewis,
W. Bastiaan Kleijn
Abstract:
Recently, various methods have been proposed to address the inconsistency issue of DDIM inversion to enable image editing, such as EDICT [36] and Null-text inversion [22]. However, the above methods introduce considerable computational overhead. In this paper, we propose a new technique, named \emph{bi-directional integration approximation} (BDIA), to perform exact diffusion inversion with neglibl…
▽ More
Recently, various methods have been proposed to address the inconsistency issue of DDIM inversion to enable image editing, such as EDICT [36] and Null-text inversion [22]. However, the above methods introduce considerable computational overhead. In this paper, we propose a new technique, named \emph{bi-directional integration approximation} (BDIA), to perform exact diffusion inversion with neglible computational overhead. Suppose we would like to estimate the next diffusion state $\boldsymbol{z}_{i-1}$ at timestep $t_i$ with the historical information $(i,\boldsymbol{z}_i)$ and $(i+1,\boldsymbol{z}_{i+1})$. We first obtain the estimated Gaussian noise $\hat{\boldsymbolε}(\boldsymbol{z}_i,i)$, and then apply the DDIM update procedure twice for approximating the ODE integration over the next time-slot $[t_i, t_{i-1}]$ in the forward manner and the previous time-slot $[t_i, t_{t+1}]$ in the backward manner. The DDIM step for the previous time-slot is used to refine the integration approximation made earlier when computing $\boldsymbol{z}_i$. A nice property of BDIA-DDIM is that the update expression for $\boldsymbol{z}_{i-1}$ is a linear combination of $(\boldsymbol{z}_{i+1}, \boldsymbol{z}_i, \hat{\boldsymbolε}(\boldsymbol{z}_i,i))$. This allows for exact backward computation of $\boldsymbol{z}_{i+1}$ given $(\boldsymbol{z}_i, \boldsymbol{z}_{i-1})$, thus leading to exact diffusion inversion. It is demonstrated with experiments that (round-trip) BDIA-DDIM is particularly effective for image editing. Our experiments further show that BDIA-DDIM produces markedly better image sampling qualities than DDIM for text-to-image generation.
BDIA can also be applied to improve the performance of other ODE solvers in addition to DDIM. In our work, it is found that applying BDIA to the EDM sampling procedure produces consistently better performance over four pre-trained models.
△ Less
Submitted 26 November, 2023; v1 submitted 10 July, 2023;
originally announced July 2023.
-
On Accelerating Diffusion-Based Sampling Process via Improved Integration Approximation
Authors:
Guoqiang Zhang,
Niwa Kenta,
W. Bastiaan Kleijn
Abstract:
A popular approach to sample a diffusion-based generative model is to solve an ordinary differential equation (ODE). In existing samplers, the coefficients of the ODE solvers are pre-determined by the ODE formulation, the reverse discrete timesteps, and the employed ODE methods. In this paper, we consider accelerating several popular ODE-based sampling processes (including EDM, DDIM, and DPM-Solve…
▽ More
A popular approach to sample a diffusion-based generative model is to solve an ordinary differential equation (ODE). In existing samplers, the coefficients of the ODE solvers are pre-determined by the ODE formulation, the reverse discrete timesteps, and the employed ODE methods. In this paper, we consider accelerating several popular ODE-based sampling processes (including EDM, DDIM, and DPM-Solver) by optimizing certain coefficients via improved integration approximation (IIA). We propose to minimize, for each time step, a mean squared error (MSE) function with respect to the selected coefficients. The MSE is constructed by applying the original ODE solver for a set of fine-grained timesteps, which in principle provides a more accurate integration approximation in predicting the next diffusion state. The proposed IIA technique does not require any change of a pre-trained model, and only introduces a very small computational overhead for solving a number of quadratic optimization problems. Extensive experiments show that considerably better FID scores can be achieved by using IIA-EDM, IIA-DDIM, and IIA-DPM-Solver than the original counterparts when the neural function evaluation (NFE) is small (i.e., less than 25).
△ Less
Submitted 3 October, 2023; v1 submitted 22 April, 2023;
originally announced April 2023.
-
Lookahead Diffusion Probabilistic Models for Refining Mean Estimation
Authors:
Guoqiang Zhang,
Niwa Kenta,
W. Bastiaan Kleijn
Abstract:
We propose lookahead diffusion probabilistic models (LA-DPMs) to exploit the correlation in the outputs of the deep neural networks (DNNs) over subsequent timesteps in diffusion probabilistic models (DPMs) to refine the mean estimation of the conditional Gaussian distributions in the backward process. A typical DPM first obtains an estimate of the original data sample $\boldsymbol{x}$ by feeding t…
▽ More
We propose lookahead diffusion probabilistic models (LA-DPMs) to exploit the correlation in the outputs of the deep neural networks (DNNs) over subsequent timesteps in diffusion probabilistic models (DPMs) to refine the mean estimation of the conditional Gaussian distributions in the backward process. A typical DPM first obtains an estimate of the original data sample $\boldsymbol{x}$ by feeding the most recent state $\boldsymbol{z}_i$ and index $i$ into the DNN model and then computes the mean vector of the conditional Gaussian distribution for $\boldsymbol{z}_{i-1}$. We propose to calculate a more accurate estimate for $\boldsymbol{x}$ by performing extrapolation on the two estimates of $\boldsymbol{x}$ that are obtained by feeding $(\boldsymbol{z}_{i+1},i+1)$ and $(\boldsymbol{z}_{i},i)$ into the DNN model. The extrapolation can be easily integrated into the backward process of existing DPMs by introducing an additional connection over two consecutive timesteps, and fine-tuning is not required. Extensive experiments showed that plugging in the additional connection into DDPM, DDIM, DEIS, S-PNDM, and high-order DPM-Solvers leads to a significant performance gain in terms of FID score.
△ Less
Submitted 21 April, 2023;
originally announced April 2023.
-
LMCodec: A Low Bitrate Speech Codec With Causal Transformer Models
Authors:
Teerapat Jenrungrot,
Michael Chinen,
W. Bastiaan Kleijn,
Jan Skoglund,
Zalán Borsos,
Neil Zeghidour,
Marco Tagliasacchi
Abstract:
We introduce LMCodec, a causal neural speech codec that provides high quality audio at very low bitrates. The backbone of the system is a causal convolutional codec that encodes audio into a hierarchy of coarse-to-fine tokens using residual vector quantization. LMCodec trains a Transformer language model to predict the fine tokens from the coarse ones in a generative fashion, allowing for the tran…
▽ More
We introduce LMCodec, a causal neural speech codec that provides high quality audio at very low bitrates. The backbone of the system is a causal convolutional codec that encodes audio into a hierarchy of coarse-to-fine tokens using residual vector quantization. LMCodec trains a Transformer language model to predict the fine tokens from the coarse ones in a generative fashion, allowing for the transmission of fewer codes. A second Transformer predicts the uncertainty of the next codes given the past transmitted codes, and is used to perform conditional entropy coding. A MUSHRA subjective test was conducted and shows that the quality is comparable to reference codecs at higher bitrates. Example audio is available at https://mjenrungrot.github.io/chrome-media-audio-papers/publications/lmcodec.
△ Less
Submitted 22 March, 2023;
originally announced March 2023.
-
Directed Diffusion: Direct Control of Object Placement through Attention Guidance
Authors:
Wan-Duo Kurt Ma,
J. P. Lewis,
Avisek Lahiri,
Thomas Leung,
W. Bastiaan Kleijn
Abstract:
Text-guided diffusion models such as DALLE-2, Imagen, eDiff-I, and Stable Diffusion are able to generate an effectively endless variety of images given only a short text prompt describing the desired image content. In many cases the images are of very high quality. However, these models often struggle to compose scenes containing several key objects such as characters in specified positional relat…
▽ More
Text-guided diffusion models such as DALLE-2, Imagen, eDiff-I, and Stable Diffusion are able to generate an effectively endless variety of images given only a short text prompt describing the desired image content. In many cases the images are of very high quality. However, these models often struggle to compose scenes containing several key objects such as characters in specified positional relationships. The missing capability to ``direct'' the placement of characters and objects both within and across images is crucial in storytelling, as recognized in the literature on film and animation theory. In this work, we take a particularly straightforward approach to providing the needed direction. Drawing on the observation that the cross-attention maps for prompt words reflect the spatial layout of objects denoted by those words, we introduce an optimization objective that produces ``activation'' at desired positions in these cross-attention maps. The resulting approach is a step toward generalizing the applicability of text-guided diffusion models beyond single images to collections of related images, as in storybooks. Directed Diffusion provides easy high-level positional control over multiple objects, while making use of an existing pre-trained model and maintaining a coherent blend between the positioned objects and the background. Moreover, it requires only a few lines to implement.
△ Less
Submitted 26 September, 2023; v1 submitted 25 February, 2023;
originally announced February 2023.
-
Estimation of Source and Receiver Positions, Room Geometry and Reflection Coefficients From a Single Room Impulse Response
Authors:
Wangyang Yu,
W. Bastiaan Kleijn
Abstract:
We propose an algorithm to estimate source and receiver positions, room geometry and reflection coefficients from a single room impulse response simultaneously. It is based on a symmetry analysis of the room impulse response. The proposed method utilizes the times of arrivals of the direct path, first order reflections and second order reflections. The proposed method is robust to erroneous pulses…
▽ More
We propose an algorithm to estimate source and receiver positions, room geometry and reflection coefficients from a single room impulse response simultaneously. It is based on a symmetry analysis of the room impulse response. The proposed method utilizes the times of arrivals of the direct path, first order reflections and second order reflections. The proposed method is robust to erroneous pulses and non-specular reflections. It can be applied to any room with parallel walls as long as the required arrival times of reflections are available. In contrast to the state-of-art method, we do not restrict the location of source and receiver.
△ Less
Submitted 22 January, 2023;
originally announced January 2023.
-
Ultra-Low-Bitrate Speech Coding with Pretrained Transformers
Authors:
Ali Siahkoohi,
Michael Chinen,
Tom Denton,
W. Bastiaan Kleijn,
Jan Skoglund
Abstract:
Speech coding facilitates the transmission of speech over low-bandwidth networks with minimal distortion. Neural-network based speech codecs have recently demonstrated significant improvements in quality over traditional approaches. While this new generation of codecs is capable of synthesizing high-fidelity speech, their use of recurrent or convolutional layers often restricts their effective rec…
▽ More
Speech coding facilitates the transmission of speech over low-bandwidth networks with minimal distortion. Neural-network based speech codecs have recently demonstrated significant improvements in quality over traditional approaches. While this new generation of codecs is capable of synthesizing high-fidelity speech, their use of recurrent or convolutional layers often restricts their effective receptive fields, which prevents them from compressing speech efficiently. We propose to further reduce the bitrate of neural speech codecs through the use of pretrained Transformers, capable of exploiting long-range dependencies in the input signal due to their inductive bias. As such, we use a pretrained Transformer in tandem with a convolutional encoder, which is trained end-to-end with a quantizer and a generative adversarial net decoder. Our numerical experiments show that supplementing the convolutional encoder of a neural speech codec with Transformer speech embeddings yields a speech codec with a bitrate of $600\,\mathrm{bps}$ that outperforms the original neural speech codec in synthesized speech quality when trained at the same bitrate. Subjective human evaluations suggest that the quality of the resulting codec is comparable or better than that of conventional codecs operating at three to four times the rate.
△ Less
Submitted 5 July, 2022;
originally announced July 2022.
-
On the Relevance of Bandwidth Extension for Speaker Verification
Authors:
Marcos Faundez-Zanuy,
Mattias Nilsson,
W. Bastiaan Kleijn
Abstract:
In this paper, we consider the effect of a bandwidth extension of narrow-band speech signals (0.3-3.4 kHz) to 0.3-8 kHz on speaker verification. Using covariance matrix based verification systems together with detection error trade-off curves, we compare the performance between systems operating on narrow-band, wide-band (0-8 kHz), and bandwidth-extended speech. The experiments were conducted usin…
▽ More
In this paper, we consider the effect of a bandwidth extension of narrow-band speech signals (0.3-3.4 kHz) to 0.3-8 kHz on speaker verification. Using covariance matrix based verification systems together with detection error trade-off curves, we compare the performance between systems operating on narrow-band, wide-band (0-8 kHz), and bandwidth-extended speech. The experiments were conducted using different short-time spectral parameterizations derived from microphone and ISDN speech databases. The studied bandwidth-extension algorithm did not introduce artifacts that affected the speaker verification task, and we achieved improvements between 1 and 10 percent (depending on the model order) over the verification system designed for narrow-band speech when mel-frequency cepstral coefficients for the short-time spectral parameterization were used.
△ Less
Submitted 5 April, 2022;
originally announced April 2022.
-
A DNN Optimizer that Improves over AdaBelief by Suppression of the Adaptive Stepsize Range
Authors:
Guoqiang Zhang,
Kenta Niwa,
W. Bastiaan Kleijn
Abstract:
We make contributions towards improving adaptive-optimizer performance. Our improvements are based on suppression of the range of adaptive stepsizes in the AdaBelief optimizer. Firstly, we show that the particular placement of the parameter epsilon within the update expressions of AdaBelief reduces the range of the adaptive stepsizes, making AdaBelief closer to SGD with momentum. Secondly, we exte…
▽ More
We make contributions towards improving adaptive-optimizer performance. Our improvements are based on suppression of the range of adaptive stepsizes in the AdaBelief optimizer. Firstly, we show that the particular placement of the parameter epsilon within the update expressions of AdaBelief reduces the range of the adaptive stepsizes, making AdaBelief closer to SGD with momentum. Secondly, we extend AdaBelief by further suppressing the range of the adaptive stepsizes. To achieve the above goal, we perform mutual layerwise vector projections between the gradient g_t and its first momentum m_t before using them to estimate the second momentum. The new optimization method is referred to as Aida. Thirdly, extensive experimental results show that Aida outperforms nine optimizers when training transformers and LSTMs for NLP, and VGG and ResNet for image classification over CIAF10 and CIFAR100 while matching the best performance of the nine methods when training WGAN-GP models for image generation tasks. Furthermore, Aida produces higher validation accuracies than AdaBelief for training ResNet18 over ImageNet. Code is available <a href="https://github.com/guoqiang-x-zhang/AidaOptimizer">at this URL</a>
△ Less
Submitted 24 January, 2023; v1 submitted 24 March, 2022;
originally announced March 2022.
-
On the relevance of bandwidth extension for speaker identification
Authors:
Marcos Faundez-Zanuy,
Mattias Nilsson,
W. Bastiaan Kleijn
Abstract:
In this paper we discuss the relevance of bandwidth extension for speaker identification tasks. Mainly we want to study if it is possible to recognize voices that have been bandwith extended. For this purpose, we created two different databases (microphonic and ISDN) of speech signals that were bandwidth extended from telephone bandwidth ([300, 3400] Hz) to full bandwidth ([100, 8000] Hz). We have…
▽ More
In this paper we discuss the relevance of bandwidth extension for speaker identification tasks. Mainly we want to study if it is possible to recognize voices that have been bandwith extended. For this purpose, we created two different databases (microphonic and ISDN) of speech signals that were bandwidth extended from telephone bandwidth ([300, 3400] Hz) to full bandwidth ([100, 8000] Hz). We have evaluated different parameterizations, and we have found that the MELCEPST parameterization can take advantage of the bandwidth extension algorithms in several situations.
△ Less
Submitted 24 February, 2022;
originally announced February 2022.
-
Extending AdamW by Leveraging Its Second Moment and Magnitude
Authors:
Guoqiang Zhang,
Niwa Kenta,
W. Bastiaan Kleijn
Abstract:
Recent work [4] analyses the local convergence of Adam in a neighbourhood of an optimal solution for a twice-differentiable function. It is found that the learning rate has to be sufficiently small to ensure local stability of the optimal solution. The above convergence results also hold for AdamW. In this work, we propose a new adaptive optimisation method by extending AdamW in two aspects with t…
▽ More
Recent work [4] analyses the local convergence of Adam in a neighbourhood of an optimal solution for a twice-differentiable function. It is found that the learning rate has to be sufficiently small to ensure local stability of the optimal solution. The above convergence results also hold for AdamW. In this work, we propose a new adaptive optimisation method by extending AdamW in two aspects with the purpose to relax the requirement on small learning rate for local stability, which we refer to as Aida. Firstly, we consider tracking the 2nd moment r_t of the pth power of the gradient-magnitudes. r_t reduces to v_t of AdamW when p=2. Suppose {m_t} is the first moment of AdamW. It is known that the update direction m_{t+1}/(v_{t+1}+epsilon)^0.5 (or m_{t+1}/(v_{t+1}^0.5+epsilon) of AdamW (or Adam) can be decomposed as the sign vector sign(m_{t+1}) multiplied elementwise by a vector of magnitudes |m_{t+1}|/(v_{t+1}+epsilon)^0.5 (or |m_{t+1}|/(v_{t+1}^0.5+epsilon)). Aida is designed to compute the qth power of the magnitude in the form of |m_{t+1}|^q/(r_{t+1}+epsilon)^(q/p) (or |m_{t+1}|^q/((r_{t+1})^(q/p)+epsilon)), which reduces to that of AdamW when (p,q)=(2,1).
Suppose the origin 0 is a local optimal solution of a twice-differentiable function. It is found theoretically that when q>1 and p>1 in Aida, the origin 0 is locally stable only when the weight-decay is non-zero. Experiments are conducted for solving ten toy optimisation problems and training Transformer and Swin-Transformer for two deep learning (DL) tasks. The empirical study demonstrates that in a number of scenarios (including the two DL tasks), Aida with particular setups of (p,q) not equal to (2,1) outperforms the setup (p,q)=(2,1) of AdamW.
△ Less
Submitted 9 December, 2021;
originally announced December 2021.
-
Confidence sets in a sparse stochastic block model with two communities of unknown sizes
Authors:
B. J. K. Kleijn,
J. van Waaij
Abstract:
In a sparse stochastic block model with two communities of unequal sizes we derive two posterior concentration inequalities, that imply (1) posterior (almost-)exact recovery of the community structure under sparsity bounds comparable to well-known sharp bounds in the planted bi-section model; (2) a construction of confidence sets for the community assignment from credible sets, with finite graph s…
▽ More
In a sparse stochastic block model with two communities of unequal sizes we derive two posterior concentration inequalities, that imply (1) posterior (almost-)exact recovery of the community structure under sparsity bounds comparable to well-known sharp bounds in the planted bi-section model; (2) a construction of confidence sets for the community assignment from credible sets, with finite graph sizes. The latter enables exact frequentist uncertain quantification with Bayesian credible sets at non-asymptotic graph sizes, where posteriors can be simulated well. There turns out to be no proportionality between credible and confidence levels: for given edge probabilities and a desired confidence level, there exists a critical graph size where the required credible level drops sharply from close to one to close to zero. At such graph sizes the frequentist decides to include not most of the posterior support for the construction of his confidence set, but only a small subset of community assignments containing the highest amounts of posterior probability (like the maximum-a-posteriori estimator). It is argued that for the proposed construction of confidence sets, a form of early stopping applies to MCMC sampling of the posterior, which would enable the computation of confidence sets at larger graph sizes.
△ Less
Submitted 16 August, 2021;
originally announced August 2021.
-
Revisiting the Primal-Dual Method of Multipliers for Optimisation over Centralised Networks
Authors:
Guoqiang Zhang,
Kenta Niwa,
W. Bastiaan Kleijn
Abstract:
The primal-dual method of multipliers (PDMM) was originally designed for solving a decomposable optimisation problem over a general network. In this paper, we revisit PDMM for optimisation over a centralized network. We first note that the recently proposed method FedSplit [1] implements PDMM for a centralized network. In [1], Inexact FedSplit (i.e., gradient based FedSplit) was also studied both…
▽ More
The primal-dual method of multipliers (PDMM) was originally designed for solving a decomposable optimisation problem over a general network. In this paper, we revisit PDMM for optimisation over a centralized network. We first note that the recently proposed method FedSplit [1] implements PDMM for a centralized network. In [1], Inexact FedSplit (i.e., gradient based FedSplit) was also studied both empirically and theoretically. We identify the cause for the poor reported performance of Inexact FedSplit, which is due to the improper initialisation in the gradient operations at the client side. To fix the issue of Inexact FedSplit, we propose two versions of Inexact PDMM, which are referred to as gradient-based PDMM (GPDMM) and accelerated GPDMM (AGPDMM), respectively. AGPDMM accelerates GPDMM at the cost of transmitting two times the number of parameters from the server to each client per iteration compared to GPDMM. We provide a new convergence bound for GPDMM for a class of convex optimisation problems. Our new bounds are tighter than those derived for Inexact FedSplit. We also investigate the update expressions of AGPDMM and SCAFFOLD to find their similarities. It is found that when the number K of gradient steps at the client side per iteration is K=1, both AGPDMM and SCAFFOLD reduce to vanilla gradient descent with proper parameter setup. Experimental results indicate that AGPDMM converges faster than SCAFFOLD when K>1 while GPDMM converges slightly worse than SCAFFOLD.
△ Less
Submitted 19 July, 2021;
originally announced July 2021.
-
Uncertainty quantification and testing in a stochastic block model with two unequal communities
Authors:
J. van Waaij,
B. J. K. Kleijn
Abstract:
We show posterior convergence for the community structure in the planted bi-section model, for several interesting priors. Examples include where the label on each vertex is iid Bernoulli distributed, with some parameter $r\in(0,1)$. The parameter $r$ may be fixed, or equipped with a beta distribution. We do not have constraints on the class sizes, which might be as small as zero, or include all v…
▽ More
We show posterior convergence for the community structure in the planted bi-section model, for several interesting priors. Examples include where the label on each vertex is iid Bernoulli distributed, with some parameter $r\in(0,1)$. The parameter $r$ may be fixed, or equipped with a beta distribution. We do not have constraints on the class sizes, which might be as small as zero, or include all vertices, and everything in between. This enables us to test between a uniform (Erdös-Rényi) random graph with no distinguishable community or the planted bi-section model. The exact bounds for posterior convergence enable us to convert credible sets into confidence sets. Symmetric testing with posterior odds is shown to be consistent.
△ Less
Submitted 13 August, 2021; v1 submitted 18 May, 2021;
originally announced May 2021.
-
Handling Background Noise in Neural Speech Generation
Authors:
Tom Denton,
Alejandro Luebs,
Felicia S. C. Lim,
Andrew Storus,
Hengchin Yeh,
W. Bastiaan Kleijn,
Jan Skoglund
Abstract:
Recent advances in neural-network based generative modeling of speech has shown great potential for speech coding. However, the performance of such models drops when the input is not clean speech, e.g., in the presence of background noise, preventing its use in practical applications. In this paper we examine the reason and discuss methods to overcome this issue. Placing a denoising preprocessing…
▽ More
Recent advances in neural-network based generative modeling of speech has shown great potential for speech coding. However, the performance of such models drops when the input is not clean speech, e.g., in the presence of background noise, preventing its use in practical applications. In this paper we examine the reason and discuss methods to overcome this issue. Placing a denoising preprocessing stage when extracting features and target clean speech during training is shown to be the best performing strategy.
△ Less
Submitted 23 February, 2021;
originally announced February 2021.
-
Generative Speech Coding with Predictive Variance Regularization
Authors:
W. Bastiaan Kleijn,
Andrew Storus,
Michael Chinen,
Tom Denton,
Felicia S. C. Lim,
Alejandro Luebs,
Jan Skoglund,
Hengchin Yeh
Abstract:
The recent emergence of machine-learning based generative models for speech suggests a significant reduction in bit rate for speech codecs is possible. However, the performance of generative models deteriorates significantly with the distortions present in real-world input signals. We argue that this deterioration is due to the sensitivity of the maximum likelihood criterion to outliers and the in…
▽ More
The recent emergence of machine-learning based generative models for speech suggests a significant reduction in bit rate for speech codecs is possible. However, the performance of generative models deteriorates significantly with the distortions present in real-world input signals. We argue that this deterioration is due to the sensitivity of the maximum likelihood criterion to outliers and the ineffectiveness of modeling a sum of independent signals with a single autoregressive model. We introduce predictive-variance regularization to reduce the sensitivity to outliers, resulting in a significant increase in performance. We show that noise reduction to remove unwanted signals can significantly increase performance. We provide extensive subjective performance evaluations that show that our system based on generative modeling provides state-of-the-art coding performance at 3 kb/s for real-world speech signals at reasonable computational complexity.
△ Less
Submitted 18 February, 2021;
originally announced February 2021.
-
Heavy tailed distributions in closing auctions
Authors:
M. Derksen,
B. Kleijn,
R. de Vilder
Abstract:
We study the tails of closing auction return distributions for a sample of liquid European stocks. We use the stochastic call auction model of Derksen et al. (2020a), to derive a relation between tail exponents of limit order placement distributions and tail exponents of the resulting closing auction return distribution and we verify this relation empirically. Counter-intuitively, large closing pr…
▽ More
We study the tails of closing auction return distributions for a sample of liquid European stocks. We use the stochastic call auction model of Derksen et al. (2020a), to derive a relation between tail exponents of limit order placement distributions and tail exponents of the resulting closing auction return distribution and we verify this relation empirically. Counter-intuitively, large closing price fluctuations are typically not caused by large market orders, instead tails become heavier when market orders are removed. The model explains this by the observation that limit orders are submitted so as to counter existing market order imbalance.
△ Less
Submitted 18 December, 2020;
originally announced December 2020.
-
Variance Constrained Autoencoding
Authors:
D. T. Braithwaite,
M. O'Connor,
W. B. Kleijn
Abstract:
Recent state-of-the-art autoencoder based generative models have an encoder-decoder structure and learn a latent representation with a pre-defined distribution that can be sampled from. Implementing the encoder networks of these models in a stochastic manner provides a natural and common approach to avoid overfitting and enforce a smooth decoder function. However, we show that for stochastic encod…
▽ More
Recent state-of-the-art autoencoder based generative models have an encoder-decoder structure and learn a latent representation with a pre-defined distribution that can be sampled from. Implementing the encoder networks of these models in a stochastic manner provides a natural and common approach to avoid overfitting and enforce a smooth decoder function. However, we show that for stochastic encoders, simultaneously attempting to enforce a distribution constraint and minimising an output distortion leads to a reduction in generative and reconstruction quality. In addition, attempting to enforce a latent distribution constraint is not reasonable when performing disentanglement. Hence, we propose the variance-constrained autoencoder (VCAE), which only enforces a variance constraint on the latent distribution. Our experiments show that VCAE improves upon Wasserstein Autoencoder and the Variational Autoencoder in both reconstruction and generative quality on MNIST and CelebA. Moreover, we show that VCAE equipped with a total correlation penalty term performs equivalently to FactorVAE at learning disentangled representations on 3D-Shapes while being a more principled approach.
△ Less
Submitted 7 May, 2020;
originally announced May 2020.
-
Uncertainty quantification in the stochastic block model with an unknown number of classes
Authors:
J. van Waaij,
B. J. K. Kleijn
Abstract:
We study the frequentist properties of Bayesian statistical inference for the stochastic block model, with an unknown number of classes of varying sizes. We equip the space of vertex labellings with a prior on the number of classes and, conditionally, a prior on the labels. The number of classes may grow to infinity as a function of the number of vertices, depending on the sparsity of the graph. W…
▽ More
We study the frequentist properties of Bayesian statistical inference for the stochastic block model, with an unknown number of classes of varying sizes. We equip the space of vertex labellings with a prior on the number of classes and, conditionally, a prior on the labels. The number of classes may grow to infinity as a function of the number of vertices, depending on the sparsity of the graph. We derive non-asymptotic posterior contraction rates of the form $P_{θ_{0,n}}Π_n(B_n\mid X^n)\le ε_n$, where $X^n$ is the observed graph, generated according to $P_{θ_{0,n}}$, $B_n$ is either $\{θ_{0, n}\}$ or, in the very sparse case, a ball around $θ_{0,n}$ of known extent, and $ε_n$ is an explicit rate of convergence.
These results enable conversion of credible sets to confidence sets. In the sparse case, credible tests are shown to be confidence sets. In the very sparse case, credible sets are enlarged to form confidence sets. Confidence levels are explicit, for each $n$, as a function of the credible level and the rate of convergence.
Hypothesis testing between the number of classes is considered with the help of posterior odds, and is shown to be consistent. Explicit upper bounds on errors of the first and second type and an explicit lower bound on the power of the tests are given.
△ Less
Submitted 4 May, 2020;
originally announced May 2020.
-
Effects of MiFID II on stock price formation
Authors:
Mike Derksen,
Bas Kleijn,
Robin de Vilder
Abstract:
This paper examines effects of MiFID II on European stock markets. We study the effects of the new tick size regime, both intraday and in the closing auction. An increase (decrease) in tick size is associated with a decrease (increase) in intraday liquidity, but a more (less) stable market. In the closing auction an increase in tick size has a positive effect on liquidity. Moreover, we report a po…
▽ More
This paper examines effects of MiFID II on European stock markets. We study the effects of the new tick size regime, both intraday and in the closing auction. An increase (decrease) in tick size is associated with a decrease (increase) in intraday liquidity, but a more (less) stable market. In the closing auction an increase in tick size has a positive effect on liquidity. Moreover, we report a positive relationship between tick size and transacted volume, in particular in the closing auction. Finally, closing auction volumes increased heavily since MiFID II and price formation in closing auctions became more efficient.
△ Less
Submitted 25 August, 2020; v1 submitted 23 March, 2020;
originally announced March 2020.
-
Distributed Network Privacy using Error Correcting Codes
Authors:
Matt O'Connor,
W. Bastiaan Kleijn
Abstract:
Most current distributed processing research deals with improving the flexibility and convergence speed of algorithms for networks of finite size with no constraints on information sharing and no concept for expected levels of signal privacy. In this work we investigate the concept of data privacy in unbounded public networks, where linear codes are used to create hard limits on the number of node…
▽ More
Most current distributed processing research deals with improving the flexibility and convergence speed of algorithms for networks of finite size with no constraints on information sharing and no concept for expected levels of signal privacy. In this work we investigate the concept of data privacy in unbounded public networks, where linear codes are used to create hard limits on the number of nodes contributing to a distributed task. We accomplish this by wrapping local observations in a linear code and intentionally applying symbol errors prior to transmission. If many nodes join the distributed task, a proportional number of symbol errors are introduced into the code leading to decoding failure if the code's predefined symbol error limit is exceeded.
△ Less
Submitted 17 December, 2019;
originally announced December 2019.
-
Approximated Orthonormal Normalisation in Training Neural Networks
Authors:
Guoqiang Zhang,
Kenta Niwa,
W. B. Kleijn
Abstract:
Generalisation of a deep neural network (DNN) is one major concern when employing the deep learning approach for solving practical problems. In this paper we propose a new technique, named approximated orthonormal normalisation (AON), to improve the generalisation capacity of a DNN model. Considering a weight matrix W from a particular neural layer in the model, our objective is to design a functi…
▽ More
Generalisation of a deep neural network (DNN) is one major concern when employing the deep learning approach for solving practical problems. In this paper we propose a new technique, named approximated orthonormal normalisation (AON), to improve the generalisation capacity of a DNN model. Considering a weight matrix W from a particular neural layer in the model, our objective is to design a function h(W) such that its row vectors are approximately orthogonal to each other while allowing the DNN model to fit the training data sufficiently accurate. By doing so, it would avoid co-adaptation among neurons of the same layer to be able to improve network-generalisation capacity. Specifically, at each iteration, we first approximate (WW^T)^(-1/2) using its Taylor expansion before multiplying the matrix W. After that, the matrix product is then normalised by applying the spectral normalisation (SN) technique to obtain h(W). Conceptually speaking, AON is designed to turn orthonormal regularisation into orthonormal normalisation to avoid manual balancing the original and penalty functions. Experimental results show that AON yields promising validation performance compared to orthonormal regularisation.
△ Less
Submitted 14 January, 2020; v1 submitted 21 November, 2019;
originally announced November 2019.
-
Generative Speech Enhancement Based on Cloned Networks
Authors:
Michael Chinen,
W. Bastiaan Kleijn,
Felicia S. C. Lim,
Jan Skoglund
Abstract:
We propose to implement speech enhancement by the regeneration of clean speech from a salient representation extracted from the noisy signal. The network that extracts salient features is trained using a set of weight-sharing clones of the extractor network. The clones receive mel-frequency spectra of different noisy versions of the same speech signal as input. By encouraging the outputs of the cl…
▽ More
We propose to implement speech enhancement by the regeneration of clean speech from a salient representation extracted from the noisy signal. The network that extracts salient features is trained using a set of weight-sharing clones of the extractor network. The clones receive mel-frequency spectra of different noisy versions of the same speech signal as input. By encouraging the outputs of the clones to be similar for these different input signals, we train a feature extractor network that is robust to noise. At inference, the salient features form the input to a WaveNet network that generates a natural and clean speech signal with the same attributes as the ground-truth clean signal. As the signal becomes noisier, our system produces natural sounding errors that stay on the speech manifold, in place of traditional artifacts found in other systems. Our experiments confirm that our generative enhancement system provides state-of-the-art enhancement performance within the generative class of enhancers according to a MUSHRA-like test. The clones based system matches or outperforms the other systems at each input signal-to-noise (SNR) range with statistical significance.
△ Less
Submitted 10 September, 2019;
originally announced September 2019.
-
Salient Speech Representations Based on Cloned Networks
Authors:
W. Bastiaan Kleijn,
Felicia S. C. Lim,
Michael Chinen,
Jan Skoglund
Abstract:
We define salient features as features that are shared by signals that are defined as being equivalent by a system designer. The definition allows the designer to contribute qualitative information. We aim to find salient features that are useful as conditioning for generative networks. We extract salient features by jointly training a set of clones of an encoder network. Each network clone receiv…
▽ More
We define salient features as features that are shared by signals that are defined as being equivalent by a system designer. The definition allows the designer to contribute qualitative information. We aim to find salient features that are useful as conditioning for generative networks. We extract salient features by jointly training a set of clones of an encoder network. Each network clone receives as input a different signal from a set of equivalent signals. The objective function encourages the network clones to map their input into a set of features that is identical across the clones. It additionally encourages feature independence and, optionally, reconstruction of a desired target signal by a decoder. As an application, we train a system that extracts a time-sequence of feature vectors of speech and uses it as a conditioning of a WaveNet generative system, facilitating both coding and enhancement.
△ Less
Submitted 19 August, 2019;
originally announced August 2019.
-
The HSIC Bottleneck: Deep Learning without Back-Propagation
Authors:
Wan-Duo Kurt Ma,
J. P. Lewis,
W. Bastiaan Kleijn
Abstract:
We introduce the HSIC (Hilbert-Schmidt independence criterion) bottleneck for training deep neural networks. The HSIC bottleneck is an alternative to the conventional cross-entropy loss and backpropagation that has a number of distinct advantages. It mitigates exploding and vanishing gradients, resulting in the ability to learn very deep networks without skip connections. There is no requirement f…
▽ More
We introduce the HSIC (Hilbert-Schmidt independence criterion) bottleneck for training deep neural networks. The HSIC bottleneck is an alternative to the conventional cross-entropy loss and backpropagation that has a number of distinct advantages. It mitigates exploding and vanishing gradients, resulting in the ability to learn very deep networks without skip connections. There is no requirement for symmetric feedback or update locking. We find that the HSIC bottleneck provides performance on MNIST/FashionMNIST/CIFAR10 classification comparable to backpropagation with a cross-entropy target, even when the system is not encouraged to make the output resemble the classification labels. Appending a single layer trained with SGD (without backpropagation) to reformat the information further improves performance.
△ Less
Submitted 5 December, 2019; v1 submitted 5 August, 2019;
originally announced August 2019.
-
Clearing price distributions in call auctions
Authors:
M. Derksen,
B. Kleijn,
R. de Vilder
Abstract:
We propose a model for price formation in financial markets based on clearing of a standard call auction with random orders, and verify its validity for prediction of the daily closing price distribution statistically. The model considers random buy and sell orders, placed following demand- and supply-side valuation distributions; an equilibrium equation then leads to a distribution for clearing p…
▽ More
We propose a model for price formation in financial markets based on clearing of a standard call auction with random orders, and verify its validity for prediction of the daily closing price distribution statistically. The model considers random buy and sell orders, placed following demand- and supply-side valuation distributions; an equilibrium equation then leads to a distribution for clearing price and transacted volume. Bid and ask volumes are left as free parameters, permitting possibly heavy-tailed or very skewed order flow conditions. In highly liquid auctions, the clearing price distribution converges to an asymptotically normal central limit, with mean and variance in terms of supply/demand-valuation distributions and order flow imbalance. By means of simulations, we illustrate the influence of variations in order flow and valuation distributions on price/volume, noting a distinction between high- and low-volume auction price variance. To verify the validity of the model statistically, we predict a year's worth of daily closing price distributions for 5 constituents of the Eurostoxx 50 index; Kolmogorov-Smirnov statistics and QQ-plots demonstrate with ample statistical significance that the model predicts closing price distributions accurately, and compares favourably with alternative methods of prediction.
△ Less
Submitted 28 November, 2019; v1 submitted 16 April, 2019;
originally announced April 2019.
-
Room Geometry Estimation from Room Impulse Responses using Convolutional Neural Networks
Authors:
Wangyang Yu,
W. Bastiaan Kleijn
Abstract:
We describe a new method to estimate the geometry of a room given room impulse responses. The method utilises convolutional neural networks to estimate the room geometry and uses the mean square error as the loss function. In contrast to existing methods, we do not require the position or distance of sources or receivers in the room. The method can be used with only a single room impulse response…
▽ More
We describe a new method to estimate the geometry of a room given room impulse responses. The method utilises convolutional neural networks to estimate the room geometry and uses the mean square error as the loss function. In contrast to existing methods, we do not require the position or distance of sources or receivers in the room. The method can be used with only a single room impulse response between one source and one receiver for room geometry estimation. The proposed estimation method can achieve an average of six centimetre accuracy. In addition, the proposed method is shown to be computationally efficient compared to state-of-the-art methods.
△ Less
Submitted 15 May, 2019; v1 submitted 1 April, 2019;
originally announced April 2019.
-
Rapidly Adapting Moment Estimation
Authors:
Guoqiang Zhang,
Kenta Niwa,
W. Bastiaan Kleijn
Abstract:
Adaptive gradient methods such as Adam have been shown to be very effective for training deep neural networks (DNNs) by tracking the second moment of gradients to compute the individual learning rates. Differently from existing methods, we make use of the most recent first moment of gradients to compute the individual learning rates per iteration. The motivation behind it is that the dynamic varia…
▽ More
Adaptive gradient methods such as Adam have been shown to be very effective for training deep neural networks (DNNs) by tracking the second moment of gradients to compute the individual learning rates. Differently from existing methods, we make use of the most recent first moment of gradients to compute the individual learning rates per iteration. The motivation behind it is that the dynamic variation of the first moment of gradients may provide useful information to obtain the learning rates. We refer to the new method as the rapidly adapting moment estimation (RAME). The theoretical convergence of deterministic RAME is studied by using an analysis similar to the one used in [1] for Adam. Experimental results for training a number of DNNs show promising performance of RAME w.r.t. the convergence speed and generalization performance compared to the stochastic heavy-ball (SHB) method, Adam, and RMSprop.
△ Less
Submitted 24 February, 2019;
originally announced February 2019.
-
Asymptotic uncertainty quantification for communities in sparse planted bi-section models
Authors:
B. J. K. Kleijn,
J. van Waaij
Abstract:
Posterior distributions for community structure in sparse planted bi-section models are shown to achieve exact (resp. almost-exact) recovery, with sharp bounds for the sparsity regimes where edge probabilities decrease as $O(\log(n)/n)$ (resp. $O(1/n)$). Assuming posterior recovery, one may interpret credible sets (resp. enlarged credible sets) as asymptotically consistent confidence sets; the dia…
▽ More
Posterior distributions for community structure in sparse planted bi-section models are shown to achieve exact (resp. almost-exact) recovery, with sharp bounds for the sparsity regimes where edge probabilities decrease as $O(\log(n)/n)$ (resp. $O(1/n)$). Assuming posterior recovery, one may interpret credible sets (resp. enlarged credible sets) as asymptotically consistent confidence sets; the diameters of those credible sets are controlled by the rate of posterior concentration. If credible levels are chosen to grow to one quickly enough, corresponding credible sets can be interpreted as frequentist confidence sets without conditions on posterior concentration. In the regimes with $O(1/n)$ edge sparsity, or when within-community and between-community edge probabilities are very close, credible sets may be enlarged to achieve frequentist asymptotic coverage, also without conditions on posterior concentration.
△ Less
Submitted 2 March, 2023; v1 submitted 22 October, 2018;
originally announced October 2018.
-
Kernel Density Estimation-Based Markov Models with Hidden State
Authors:
Gustav Eje Henter,
Arne Leijon,
W. Bastiaan Kleijn
Abstract:
We consider Markov models of stochastic processes where the next-step conditional distribution is defined by a kernel density estimator (KDE), similar to Markov forecast densities and certain time-series bootstrap schemes. The KDE Markov models (KDE-MMs) we discuss are nonlinear, nonparametric, fully probabilistic representations of stationary processes, based on techniques with strong asymptotic…
▽ More
We consider Markov models of stochastic processes where the next-step conditional distribution is defined by a kernel density estimator (KDE), similar to Markov forecast densities and certain time-series bootstrap schemes. The KDE Markov models (KDE-MMs) we discuss are nonlinear, nonparametric, fully probabilistic representations of stationary processes, based on techniques with strong asymptotic consistency properties. The models generate new data by concatenating points from the training data sequences in a context-sensitive manner, together with some additive driving noise. We present novel EM-type maximum-likelihood algorithms for data-driven bandwidth selection in KDE-MMs. Additionally, we augment the KDE-MMs with a hidden state, yielding a new model class, KDE-HMMs. The added state variable captures non-Markovian long memory and signal structure (e.g., slow oscillations), complementing the short-range dependences described by the Markov process. The resulting joint Markov and hidden-Markov structure is appealing for modelling complex real-world processes such as speech signals. We present guaranteed-ascent EM-update equations for model parameters in the case of Gaussian kernels, as well as relaxed update formulas that greatly accelerate training in practice. Experiments demonstrate increased held-out set probability for KDE-HMMs on several challenging natural and synthetic data series, compared to traditional techniques such as autoregressive models, HMMs, and their combinations.
△ Less
Submitted 30 July, 2018;
originally announced July 2018.
-
Bounded Information Rate Variational Autoencoders
Authors:
D. T. Braithwaite,
W. B. Kleijn
Abstract:
This paper introduces a new member of the family of Variational Autoencoders (VAE) that constrains the rate of information transferred by the latent layer. The latent layer is interpreted as a communication channel, the information rate of which is bound by imposing a pre-set signal-to-noise ratio. The new constraint subsumes the mutual information between the input and latent variables, combining…
▽ More
This paper introduces a new member of the family of Variational Autoencoders (VAE) that constrains the rate of information transferred by the latent layer. The latent layer is interpreted as a communication channel, the information rate of which is bound by imposing a pre-set signal-to-noise ratio. The new constraint subsumes the mutual information between the input and latent variables, combining naturally with the likelihood objective of the observed data as used in a conventional VAE. The resulting Bounded-Information-Rate Variational Autoencoder (BIR-VAE) provides a meaningful latent representation with an information resolution that can be specified directly in bits by the system designer. The rate constraint can be used to prevent overtraining, and the method naturally facilitates quantisation of the latent variables at the set rate. Our experiments confirm that the BIR-VAE has a meaningful latent representation and that its performance is at least as good as state-of-the-art competing algorithms, but with lower computational complexity.
△ Less
Submitted 25 July, 2018; v1 submitted 19 July, 2018;
originally announced July 2018.
-
Bregman Monotone Operator Splitting
Authors:
Kenta Niwa,
W. Bastiaan Kleijn
Abstract:
Monotone operator splitting is a powerful paradigm that facilitates parallel processing for optimization problems where the cost function can be split into two convex functions. We propose a generalized form of monotone operator splitting based on Bregman divergence. We show that an appropriate design of the Bregman divergence leads to faster convergence than conventional splitting algorithms. The…
▽ More
Monotone operator splitting is a powerful paradigm that facilitates parallel processing for optimization problems where the cost function can be split into two convex functions. We propose a generalized form of monotone operator splitting based on Bregman divergence. We show that an appropriate design of the Bregman divergence leads to faster convergence than conventional splitting algorithms. The proposed Bregman monotone operator splitting (B-MOS) is applied to an application to illustrate its effectiveness. B-MOS was found to significantly improve the convergence rate.
△ Less
Submitted 10 November, 2018; v1 submitted 12 July, 2018;
originally announced July 2018.
-
Directional emphasis in ambisonics
Authors:
W. Bastiaan Kleijn
Abstract:
We describe an ambisonics enhancement method that increases the signal strength in specified directions at low computational cost. The method can be used in a static setup to emphasize the signal arriving from a particular direction or set of directions. It can also be used in an adaptive arrangement where it sharpens directionality and reduces the distortion in timbre associated with low-degree a…
▽ More
We describe an ambisonics enhancement method that increases the signal strength in specified directions at low computational cost. The method can be used in a static setup to emphasize the signal arriving from a particular direction or set of directions. It can also be used in an adaptive arrangement where it sharpens directionality and reduces the distortion in timbre associated with low-degree ambisonics representations. The emphasis operator has very low computational complexity and can be applied to time-domain as well as time-frequency ambisonics representations. The operator upscales a low-degree ambisonics representation to a higher degree representation.
△ Less
Submitted 24 May, 2018; v1 submitted 18 March, 2018;
originally announced March 2018.
-
Wavenet based low rate speech coding
Authors:
W. Bastiaan Kleijn,
Felicia S. C. Lim,
Alejandro Luebs,
Jan Skoglund,
Florian Stimberg,
Quan Wang,
Thomas C. Walters
Abstract:
Traditional parametric coding of speech facilitates low rate but provides poor reconstruction quality because of the inadequacy of the model used. We describe how a WaveNet generative speech model can be used to generate high quality speech from the bit stream of a standard parametric coder operating at 2.4 kb/s. We compare this parametric coder with a waveform coder based on the same generative m…
▽ More
Traditional parametric coding of speech facilitates low rate but provides poor reconstruction quality because of the inadequacy of the model used. We describe how a WaveNet generative speech model can be used to generate high quality speech from the bit stream of a standard parametric coder operating at 2.4 kb/s. We compare this parametric coder with a waveform coder based on the same generative model and show that approximating the signal waveform incurs a large rate penalty. Our experiments confirm the high performance of the WaveNet based coder and show that the speech produced by the system is able to additionally perform implicit bandwidth extension and does not significantly impair recognition of the original speaker for the human listener, even when that speaker has not been used during the training of the generative model.
△ Less
Submitted 1 December, 2017;
originally announced December 2017.
-
Finite Synchrosqueezing Transform Based On The STFT
Authors:
Mozhgan Mohammadpour,
Bastiaan Kleijn,
Rajab Ali Kamyabi Gol
Abstract:
The finite STFT Synchrosqueezing transform is a time-frequency analysis method that can decompose finite complex signals into time-varying oscillatory components. This representation is sparse and invertible, allowing recovery of the original signal. The STFT Synchrosqueezing transform on finite dimensional signals has the advantage of an efficient matrix representation. This article defines the f…
▽ More
The finite STFT Synchrosqueezing transform is a time-frequency analysis method that can decompose finite complex signals into time-varying oscillatory components. This representation is sparse and invertible, allowing recovery of the original signal. The STFT Synchrosqueezing transform on finite dimensional signals has the advantage of an efficient matrix representation. This article defines the finite STFT Synchrosqueezing transform and describes some properties of this transform. We compare the finite STFT and the finite STFT Synchrosqueezing transform by applying these transform to a set of signals.
△ Less
Submitted 23 September, 2017;
originally announced September 2017.
-
On Relationship between Primal-Dual Method of Multipliers and Kalman Filter
Authors:
Guoqiang Zhang,
W. Bastiaan Kleijn,
Richard Heusdens
Abstract:
Recently the primal-dual method of multipliers (PDMM), a novel distributed optimization method, was proposed for solving a general class of decomposable convex optimizations over graphic models. In this work, we first study the convergence properties of PDMM for decomposable quadratic optimizations over tree-structured graphs. We show that with proper parameter selection, PDMM converges to its opt…
▽ More
Recently the primal-dual method of multipliers (PDMM), a novel distributed optimization method, was proposed for solving a general class of decomposable convex optimizations over graphic models. In this work, we first study the convergence properties of PDMM for decomposable quadratic optimizations over tree-structured graphs. We show that with proper parameter selection, PDMM converges to its optimal solution in finite number of iterations. We then apply PDMM for the causal estimation problem over a statistical linear state-space model. We show that PDMM and the Kalman filter have the same update expressions, where PDMM can be interpreted as solving a sequence of quadratic optimizations over a growing chain graph.
△ Less
Submitted 22 August, 2017;
originally announced August 2017.
-
An evaluation of intrusive instrumental intelligibility metrics
Authors:
Steven Van Kuyk,
W. Bastiaan Kleijn,
Richard C. Hendriks
Abstract:
Instrumental intelligibility metrics are commonly used as an alternative to listening tests. This paper evaluates 12 monaural intrusive intelligibility metrics: SII, HEGP, CSII, HASPI, NCM, QSTI, STOI, ESTOI, MIKNN, SIMI, SIIB, and $\text{sEPSM}^\text{corr}$. In addition, this paper investigates the ability of intelligibility metrics to generalize to new types of distortions and analyzes why the t…
▽ More
Instrumental intelligibility metrics are commonly used as an alternative to listening tests. This paper evaluates 12 monaural intrusive intelligibility metrics: SII, HEGP, CSII, HASPI, NCM, QSTI, STOI, ESTOI, MIKNN, SIMI, SIIB, and $\text{sEPSM}^\text{corr}$. In addition, this paper investigates the ability of intelligibility metrics to generalize to new types of distortions and analyzes why the top performing metrics have high performance. The intelligibility data were obtained from 11 listening tests described in the literature. The stimuli included Dutch, Danish, and English speech that was distorted by additive noise, reverberation, competing talkers, pre-processing enhancement, and post-processing enhancement. SIIB and HASPI had the highest performance achieving a correlation with listening test scores on average of $ρ=0.92$ and $ρ=0.89$, respectively. The high performance of SIIB may, in part, be the result of SIIBs developers having access to all the intelligibility data considered in the evaluation. The results show that intelligibility metrics tend to perform poorly on data sets that were not used during their development. By modifying the original implementations of SIIB and STOI, the advantage of reducing statistical dependencies between input features is demonstrated. Additionally, the paper presents a new version of SIIB called $\text{SIIB}^\text{Gauss}$, which has similar performance to SIIB and HASPI, but takes less time to compute by two orders of magnitude.
△ Less
Submitted 28 July, 2018; v1 submitted 20 August, 2017;
originally announced August 2017.
-
An instrumental intelligibility metric based on information theory
Authors:
Steven Van Kuyk,
W. Bastiaan Kleijn,
Richard C. Hendriks
Abstract:
We propose a monaural intrusive instrumental intelligibility metric called speech intelligibility in bits (SIIB). SIIB is an estimate of the amount of information shared between a talker and a listener in bits per second. Unlike existing information theoretic intelligibility metrics, SIIB accounts for talker variability and statistical dependencies between time-frequency units. Our evaluation show…
▽ More
We propose a monaural intrusive instrumental intelligibility metric called speech intelligibility in bits (SIIB). SIIB is an estimate of the amount of information shared between a talker and a listener in bits per second. Unlike existing information theoretic intelligibility metrics, SIIB accounts for talker variability and statistical dependencies between time-frequency units. Our evaluation shows that relative to state-of-the-art intelligibility metrics, SIIB is highly correlated with the intelligibility of speech that has been degraded by noise and processed by speech enhancement algorithms.
△ Less
Submitted 14 January, 2018; v1 submitted 17 August, 2017;
originally announced August 2017.
-
Derivation and Analysis of the Primal-Dual Method of Multipliers Based on Monotone Operator Theory
Authors:
Thomas Sherson,
Richard Heusdens,
W. Bastiaan Kleijn
Abstract:
In this paper we present a novel derivation for an existing node-based algorithm for distributed optimisation termed the primal-dual method of multipliers (PDMM). In contrast to its initial derivation, in this work monotone operator theory is used to connect PDMM with other first-order methods such as Douglas-Rachford splitting and the alternating direction method of multipliers thus providing ins…
▽ More
In this paper we present a novel derivation for an existing node-based algorithm for distributed optimisation termed the primal-dual method of multipliers (PDMM). In contrast to its initial derivation, in this work monotone operator theory is used to connect PDMM with other first-order methods such as Douglas-Rachford splitting and the alternating direction method of multipliers thus providing insight to the operation of the scheme. In particular, we show how PDMM combines a lifted dual form in conjunction with Peaceman-Rachford splitting to remove the need for collaboration between nodes per iteration. We demonstrate sufficient conditions for strong primal convergence for a general class of functions while under the assumption of strong convexity and functional smoothness, we also introduce a primal geometric convergence bound. Finally we introduce a distributed method of parameter selection in the geometric convergent case, requiring only finite transmissions to implement regardless of network topology.
△ Less
Submitted 6 November, 2017; v1 submitted 8 June, 2017;
originally announced June 2017.
-
Cross-modal Subspace Learning for Fine-grained Sketch-based Image Retrieval
Authors:
Peng Xu,
Qiyue Yin,
Yongye Huang,
Yi-Zhe Song,
Zhanyu Ma,
Liang Wang,
Tao Xiang,
W. Bastiaan Kleijn,
Jun Guo
Abstract:
Sketch-based image retrieval (SBIR) is challenging due to the inherent domain-gap between sketch and photo. Compared with pixel-perfect depictions of photos, sketches are iconic renderings of the real world with highly abstract. Therefore, matching sketch and photo directly using low-level visual clues are unsufficient, since a common low-level subspace that traverses semantically across the two m…
▽ More
Sketch-based image retrieval (SBIR) is challenging due to the inherent domain-gap between sketch and photo. Compared with pixel-perfect depictions of photos, sketches are iconic renderings of the real world with highly abstract. Therefore, matching sketch and photo directly using low-level visual clues are unsufficient, since a common low-level subspace that traverses semantically across the two modalities is non-trivial to establish. Most existing SBIR studies do not directly tackle this cross-modal problem. This naturally motivates us to explore the effectiveness of cross-modal retrieval methods in SBIR, which have been applied in the image-text matching successfully. In this paper, we introduce and compare a series of state-of-the-art cross-modal subspace learning methods and benchmark them on two recently released fine-grained SBIR datasets. Through thorough examination of the experimental results, we have demonstrated that the subspace learning can effectively model the sketch-photo domain-gap. In addition we draw a few key insights to drive future research.
△ Less
Submitted 27 May, 2017;
originally announced May 2017.
-
Training Deep Neural Networks via Optimization Over Graphs
Authors:
Guoqiang Zhang,
W. Bastiaan Kleijn
Abstract:
In this work, we propose to train a deep neural network by distributed optimization over a graph. Two nonlinear functions are considered: the rectified linear unit (ReLU) and a linear unit with both lower and upper cutoffs (DCutLU). The problem reformulation over a graph is realized by explicitly representing ReLU or DCutLU using a set of slack variables. We then apply the alternating direction me…
▽ More
In this work, we propose to train a deep neural network by distributed optimization over a graph. Two nonlinear functions are considered: the rectified linear unit (ReLU) and a linear unit with both lower and upper cutoffs (DCutLU). The problem reformulation over a graph is realized by explicitly representing ReLU or DCutLU using a set of slack variables. We then apply the alternating direction method of multipliers (ADMM) to update the weights of the network layerwise by solving subproblems of the reformulated problem. Empirical results suggest that the ADMM-based method is less sensitive to overfitting than the stochastic gradient descent (SGD) and Adam methods.
△ Less
Submitted 17 June, 2017; v1 submitted 10 February, 2017;
originally announced February 2017.
-
On the frequentist validity of Bayesian limits
Authors:
B. J. K. Kleijn
Abstract:
To the frequentist who computes posteriors, not all priors are useful asymptotically: in this paper Schwartz's 1965 Kullback-Leibler condition is generalised to enable frequentist interpretation of convergence of posterior distributions with the complex models and often dependent datasets in present-day statistical applications. We prove four simple and fully general frequentist theorems, for post…
▽ More
To the frequentist who computes posteriors, not all priors are useful asymptotically: in this paper Schwartz's 1965 Kullback-Leibler condition is generalised to enable frequentist interpretation of convergence of posterior distributions with the complex models and often dependent datasets in present-day statistical applications. We prove four simple and fully general frequentist theorems, for posterior consistency; for posterior rates of convergence; for consistency of the Bayes factor in hypothesis testing or model selection; and a theorem to obtain confidence sets from credible sets. The latter has a significant methodological consequence in frequentist uncertainty quantification: use of a suitable prior allows one to convert credible sets of a calculated, simulated or approximated posterior into asymptotically consistent confidence sets, in full generality. This extends the main inferential implication of the Bernstein-von Mises theorem to non-parametric models without smoothness conditions. Proofs require the existence of a Bayesian type of test sequence and priors giving rise to local prior predictive distributions that satisfy a weakened form of Le~Cam's contiguity with respect to the data distribution. Results are applied in a wide range of examples and counterexamples.
△ Less
Submitted 27 November, 2017; v1 submitted 25 November, 2016;
originally announced November 2016.
-
The semi-parametric Bernstein-von Mises theorem for regression models with symmetric errors
Authors:
Minwoo Chae,
Yongdai Kim,
Bas Kleijn
Abstract:
In a smooth semi-parametric model, the marginal posterior distribution for a finite dimensional parameter of interest is expected to be asymptotically equivalent to the sampling distribution of any efficient point-estimator. The assertion leads to asymptotic equivalence of credible and confidence sets for the parameter of interest and is known as the semi-parametric Bernstein-von Mises theorem. In…
▽ More
In a smooth semi-parametric model, the marginal posterior distribution for a finite dimensional parameter of interest is expected to be asymptotically equivalent to the sampling distribution of any efficient point-estimator. The assertion leads to asymptotic equivalence of credible and confidence sets for the parameter of interest and is known as the semi-parametric Bernstein-von Mises theorem. In recent years, it has received much attention and has been applied in many examples. We consider models in which errors with symmetric densities play a role; more specifically, it is shown that the marginal posterior distributions of regression coefficients in the linear regression and linear mixed effect models satisfy the semi-parametric Bernstein-von Mises assertion. As a consequence, Bayes estimators in these models achieve frequentist inferential optimality, as expressed e.g. through Hajek's convolution and asymptotic minimax theorems. Conditions for the prior on the space of error densities are relatively mild and well-known constructions like the Dirichlet process mixture of normal densities and random series priors constitute valid choices. Particularly, the result provides an efficient estimate of regression coefficients in the linear mixed effect model, for which no other efficient point-estimator was known previously.
△ Less
Submitted 12 January, 2017; v1 submitted 14 July, 2016;
originally announced July 2016.
-
Deep Reconstruction-Classification Networks for Unsupervised Domain Adaptation
Authors:
Muhammad Ghifary,
W. Bastiaan Kleijn,
Mengjie Zhang,
David Balduzzi,
Wen Li
Abstract:
In this paper, we propose a novel unsupervised domain adaptation algorithm based on deep learning for visual object recognition. Specifically, we design a new model called Deep Reconstruction-Classification Network (DRCN), which jointly learns a shared encoding representation for two tasks: i) supervised classification of labeled source data, and ii) unsupervised reconstruction of unlabeled target…
▽ More
In this paper, we propose a novel unsupervised domain adaptation algorithm based on deep learning for visual object recognition. Specifically, we design a new model called Deep Reconstruction-Classification Network (DRCN), which jointly learns a shared encoding representation for two tasks: i) supervised classification of labeled source data, and ii) unsupervised reconstruction of unlabeled target data.In this way, the learnt representation not only preserves discriminability, but also encodes useful information from the target domain. Our new DRCN model can be optimized by using backpropagation similarly as the standard neural networks.
We evaluate the performance of DRCN on a series of cross-domain object recognition tasks, where DRCN provides a considerable improvement (up to ~8% in accuracy) over the prior state-of-the-art algorithms. Interestingly, we also observe that the reconstruction pipeline of DRCN transforms images from the source domain into images whose appearance resembles the target dataset. This suggests that DRCN's performance is due to constructing a single composite representation that encodes information about both the structure of target images and the classification of source images. Finally, we provide a formal analysis to justify the algorithm's objective in domain adaptation context.
△ Less
Submitted 1 August, 2016; v1 submitted 12 July, 2016;
originally announced July 2016.