Search | arXiv e-print repository

A Survey on Time-Series Pre-Trained Models

Authors: Qianli Ma, Zhen Liu, Zhenjing Zheng, Ziyang Huang, Siying Zhu, Zhongzhong Yu, James T. Kwok

Abstract: Time-Series Mining (TSM) is an important research area since it shows great potential in practical applications. Deep learning models that rely on massive labeled data have been utilized for TSM successfully. However, constructing a large-scale well-labeled dataset is difficult due to data annotation costs. Recently, pre-trained models have gradually attracted attention in the time series domain d… ▽ More Time-Series Mining (TSM) is an important research area since it shows great potential in practical applications. Deep learning models that rely on massive labeled data have been utilized for TSM successfully. However, constructing a large-scale well-labeled dataset is difficult due to data annotation costs. Recently, pre-trained models have gradually attracted attention in the time series domain due to their remarkable performance in computer vision and natural language processing. In this survey, we provide a comprehensive review of Time-Series Pre-Trained Models (TS-PTMs), aiming to guide the understanding, applying, and studying TS-PTMs. Specifically, we first briefly introduce the typical deep learning models employed in TSM. Then, we give an overview of TS-PTMs according to the pre-training techniques. The main categories we explore include supervised, unsupervised, and self-supervised TS-PTMs. Further, extensive experiments involving 27 methods, 434 datasets, and 679 transfer learning scenarios are conducted to analyze the advantages and disadvantages of transfer learning strategies, Transformer-based models, and representative TS-PTMs. Finally, we point out some potential directions of TS-PTMs for future work. △ Less

Submitted 4 October, 2024; v1 submitted 18 May, 2023; originally announced May 2023.

Comments: Accepted in the IEEE Transactions on Knowledge and Data Engineering (TKDE)

arXiv:2303.18049 [pdf, other]

No Place to Hide: Dual Deep Interaction Channel Network for Fake News Detection based on Data Augmentation

Authors: Biwei Cao, Lulu Hua, Jiuxin Cao, Jie Gui, Bo Liu, James Tin-Yau Kwok

Abstract: Online Social Network (OSN) has become a hotbed of fake news due to the low cost of information dissemination. Although the existing methods have made many attempts in news content and propagation structure, the detection of fake news is still facing two challenges: one is how to mine the unique key features and evolution patterns, and the other is how to tackle the problem of small samples to bui… ▽ More Online Social Network (OSN) has become a hotbed of fake news due to the low cost of information dissemination. Although the existing methods have made many attempts in news content and propagation structure, the detection of fake news is still facing two challenges: one is how to mine the unique key features and evolution patterns, and the other is how to tackle the problem of small samples to build the high-performance model. Different from popular methods which take full advantage of the propagation topology structure, in this paper, we propose a novel framework for fake news detection from perspectives of semantic, emotion and data enhancement, which excavates the emotional evolution patterns of news participants during the propagation process, and a dual deep interaction channel network of semantic and emotion is designed to obtain a more comprehensive and fine-grained news representation with the consideration of comments. Meanwhile, the framework introduces a data enhancement module to obtain more labeled data with high quality based on confidence which further improves the performance of the classification model. Experiments show that the proposed approach outperforms the state-of-the-art methods. △ Less

Submitted 31 March, 2023; originally announced March 2023.

arXiv:2303.17255 [pdf, other]

Fooling the Image Dehazing Models by First Order Gradient

Authors: Jie Gui, Xiaofeng Cong, Chengwei Peng, Yuan Yan Tang, James Tin-Yau Kwok

Abstract: The research on the single image dehazing task has been widely explored. However, as far as we know, no comprehensive study has been conducted on the robustness of the well-trained dehazing models. Therefore, there is no evidence that the dehazing networks can resist malicious attacks. In this paper, we focus on designing a group of attack methods based on first order gradient to verify the robust… ▽ More The research on the single image dehazing task has been widely explored. However, as far as we know, no comprehensive study has been conducted on the robustness of the well-trained dehazing models. Therefore, there is no evidence that the dehazing networks can resist malicious attacks. In this paper, we focus on designing a group of attack methods based on first order gradient to verify the robustness of the existing dehazing algorithms. By analyzing the general purpose of image dehazing task, four attack methods are proposed, which are predicted dehazed image attack, hazy layer mask attack, haze-free image attack and haze-preserved attack. The corresponding experiments are conducted on six datasets with different scales. Further, the defense strategy based on adversarial training is adopted for reducing the negative effects caused by malicious attacks. In summary, this paper defines a new challenging problem for the image dehazing area, which can be called as adversarial attack on dehazing networks (AADN). Code and Supplementary Material are available at https://github.com/Xiaofeng-life/AADN Dehazing. △ Less

Submitted 15 February, 2024; v1 submitted 30 March, 2023; originally announced March 2023.

Comments: This paper is accepted by IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)

arXiv:2303.02405 [pdf, other]

Decision Support System for Chronic Diseases Based on Drug-Drug Interactions

Authors: Tian Bian, Yuli Jiang, Jia Li, Tingyang Xu, Yu Rong, Yi Su, Timothy Kwok, Helen Meng, Hong Cheng

Abstract: Many patients with chronic diseases resort to multiple medications to relieve various symptoms, which raises concerns about the safety of multiple medication use, as severe drug-drug antagonism can lead to serious adverse effects or even death. This paper presents a Decision Support System, called DSSDDI, based on drug-drug interactions to support doctors prescribing decisions. DSSDDI contains thr… ▽ More Many patients with chronic diseases resort to multiple medications to relieve various symptoms, which raises concerns about the safety of multiple medication use, as severe drug-drug antagonism can lead to serious adverse effects or even death. This paper presents a Decision Support System, called DSSDDI, based on drug-drug interactions to support doctors prescribing decisions. DSSDDI contains three modules, Drug-Drug Interaction (DDI) module, Medical Decision (MD) module and Medical Support (MS) module. The DDI module learns safer and more effective drug representations from the drug-drug interactions. To capture the potential causal relationship between DDI and medication use, the MD module considers the representations of patients and drugs as context, DDI and patients' similarity as treatment, and medication use as outcome to construct counterfactual links for the representation learning. Furthermore, the MS module provides drug candidates to doctors with explanations. Experiments on the chronic data collected from the Hong Kong Chronic Disease Study Project and a public diagnostic data MIMIC-III demonstrate that DSSDDI can be a reliable reference for doctors in terms of safety and efficiency of clinical diagnosis, with significant improvements compared to baseline methods. △ Less

Submitted 4 March, 2023; originally announced March 2023.

Journal ref: ICDE2023

arXiv:2301.05177 [pdf, other]

Searching for Heavy Neutral Leptons at A Future Muon Collider

Authors: Tsz Hong Kwok, Lingfeng Li, Tao Liu, Ariel Rock

Abstract: As the planning stages for a high energy muon collider enter a more concrete era, an important question arises as to what new physics could be uncovered. A TeV-scale muon collider is also a vector boson fusion (VBF) factory with a very clean background, and as such it is a promising environment to look for new physics that couples to the electroweak (EW) sector. In this paper, we explore the abili… ▽ More As the planning stages for a high energy muon collider enter a more concrete era, an important question arises as to what new physics could be uncovered. A TeV-scale muon collider is also a vector boson fusion (VBF) factory with a very clean background, and as such it is a promising environment to look for new physics that couples to the electroweak (EW) sector. In this paper, we explore the ability of a future TeV-scale muon collider to search for Majorana and Dirac Heavy Neutral Leptons (HNLs) produced via EW bosons. Employing a model-independent, conservative approach, we present an estimation of the production and decay rate of HNLs over a mass range between 200 GeV and 9.5 TeV in two benchmark collider proposals with $\sqrt{s}=3,\,10$ TeV, as well as an estimation of the dominant Standard Model (SM) background. We find that exclusion limits for the mixing between the HNLs and SM neutrinos can be as low as $\mathcal{O}(10^{-6})$. Additionally, we demonstrate that a TeV-scale muon collider allows for the ability to discriminate between Majorana and Dirac type HNLs for a large range of mixing values. △ Less

Submitted 19 January, 2023; v1 submitted 12 January, 2023; originally announced January 2023.

Comments: 26 pages, 13 figures, 2 tables. v2: references updated, typos fixed

arXiv:2301.03041 [pdf, other]

doi 10.1109/TIP.2023.3276708

Learning the Relation between Similarity Loss and Clustering Loss in Self-Supervised Learning

Authors: Jidong Ge, Yuxiang Liu, Jie Gui, Lanting Fang, Ming Lin, James Tin-Yau Kwok, LiGuo Huang, Bin Luo

Abstract: Self-supervised learning enables networks to learn discriminative features from massive data itself. Most state-of-the-art methods maximize the similarity between two augmentations of one image based on contrastive learning. By utilizing the consistency of two augmentations, the burden of manual annotations can be freed. Contrastive learning exploits instance-level information to learn robust feat… ▽ More Self-supervised learning enables networks to learn discriminative features from massive data itself. Most state-of-the-art methods maximize the similarity between two augmentations of one image based on contrastive learning. By utilizing the consistency of two augmentations, the burden of manual annotations can be freed. Contrastive learning exploits instance-level information to learn robust features. However, the learned information is probably confined to different views of the same instance. In this paper, we attempt to leverage the similarity between two distinct images to boost representation in self-supervised learning. In contrast to instance-level information, the similarity between two distinct images may provide more useful information. Besides, we analyze the relation between similarity loss and feature-level cross-entropy loss. These two losses are essential for most deep learning methods. However, the relation between these two losses is not clear. Similarity loss helps obtain instance-level representation, while feature-level cross-entropy loss helps mine the similarity between two distinct images. We provide theoretical analyses and experiments to show that a suitable combination of these two losses can get state-of-the-art results. Code is available at https://github.com/guijiejie/ICCL. △ Less

Submitted 5 June, 2023; v1 submitted 8 January, 2023; originally announced January 2023.

Comments: This paper is accepted by IEEE Transactions on Image Processing

arXiv:2301.01190 [pdf, other]

Carbon in solution and the Charpy impact performance of medium Mn steels

Authors: TWK Kwok, FF Worsnop, JO Douglas, D Dye

Abstract: Carbon is a well known austenite stabiliser and can be used to alter the stacking fault energy and stability against martensitic transformation in medium Mn steels, producing a range of deformation mechanisms such as the Transformation Induced Plasticity (TRIP) or combined Twinning and Transformation Induced Plasticity (TWIP $+$ TRIP) effects. However, the effect of C beyond quasi-static tensile b… ▽ More Carbon is a well known austenite stabiliser and can be used to alter the stacking fault energy and stability against martensitic transformation in medium Mn steels, producing a range of deformation mechanisms such as the Transformation Induced Plasticity (TRIP) or combined Twinning and Transformation Induced Plasticity (TWIP $+$ TRIP) effects. However, the effect of C beyond quasi-static tensile behaviour is less well known. Therefore, two medium Mn steels with 0.2 wt\% and 0.5 wt\% C were designed to produce similar austenite fractions and stability and therefore tensile behaviour. These were processed to form lamellar and mixed equiaxed $+$ lamellar microstructures. The low C steel had a corrected Charpy impact energy (KV\textsubscript{10}) of 320 J cm\textsuperscript{-2} compared to 66 J cm\textsuperscript{-2} in the high C steel despite both having a ductility of over 35\%. Interface segregation, e.g. of tramp elements, was investigated as a potential cause and none was found. Only a small amount of Mn rejection from partitioning was observed at the interface. The fracture surfaces were investigated and the TRIP effect was found to occur more readily in the Low C Charpy specimen. Therefore it is concluded that the use of C to promote TWIP$+$TRIP behaviour should be avoided in alloy design but the Charpy impact performance can be understood purely in terms of C in solution. △ Less

Submitted 3 January, 2023; originally announced January 2023.

arXiv:2212.02433 [pdf, other]

Testing Lepton Flavor Universality at Future $Z$ Factories

Authors: Tin Seng Manfred Ho, Xu-Hui Jiang, Tsz Hong Kwok, Lingfeng Li, Tao Liu

Abstract: As one of the hypothetical principles in the Standard Model (SM), lepton flavor universality (LFU) should be tested with a precision as high as possible such that the physics violating this principle can be fully examined. The run of $Z$ factory at a future $e^+e^-$ collider such as CEPC or FCC-$ee$ provides a great opportunity to perform this task because of the large statistics and high reconstr… ▽ More As one of the hypothetical principles in the Standard Model (SM), lepton flavor universality (LFU) should be tested with a precision as high as possible such that the physics violating this principle can be fully examined. The run of $Z$ factory at a future $e^+e^-$ collider such as CEPC or FCC-$ee$ provides a great opportunity to perform this task because of the large statistics and high reconstruction efficiencies for $b$-hadrons at $Z$ pole. In this paper, we present a systematic study on the LFU test in the future $Z$ factories. The goal is three-fold. Firstly, we study the sensitivities of measuring the LFU-violating observables of $b\to c τν$, $i.e.$, $R_{J/ψ}$, $R_{D_s}$, $R_{D_s^\ast}$ and $R_{Λ_c}$, where $τ$ decays muonically. For this purpose, we develop the strategies for event reconstruction, based on the track information significantly. Secondly, we explore the sensitivity robustness against detector performance and its potential improvement with the message of event shape or beyond the $b$-hadron decays. A picture is drawn on the variation of analysis sensitivities with the detector tracking resolution and soft photon detectability, and the impact of Fox-Wolfram moments is studied on the measurement of relevant flavor events. Finally, we interpret the projected sensitivities in the SM effective field theory, by combining the LFU tests of $b\to c τν$ and the measurements of $b\to s τ^+τ^-$ and $b\to s \barν ν$. We show that the limits on the LFU-violating energy scale can be pushed up to $\sim \mathcal{O} (10)$~TeV for $\lesssim \mathcal O(1)$ Wilson coefficients at Tera-$Z$. △ Less

Submitted 5 December, 2022; originally announced December 2022.

Comments: 63 pages, 27 figures

arXiv:2211.15362 [pdf, other]

Exploring the Coordination of Frequency and Attention in Masked Image Modeling

Authors: Jie Gui, Tuo Chen, Minjing Dong, Zhengqi Liu, Hao Luo, James Tin-Yau Kwok, Yuan Yan Tang

Abstract: Recently, masked image modeling (MIM), which learns visual representations by reconstructing the masked patches of an image, has dominated self-supervised learning in computer vision. However, the pre-training of MIM always takes massive time due to the large-scale data and large-size backbones. We mainly attribute it to the random patch masking in previous MIM works, which fails to leverage the c… ▽ More Recently, masked image modeling (MIM), which learns visual representations by reconstructing the masked patches of an image, has dominated self-supervised learning in computer vision. However, the pre-training of MIM always takes massive time due to the large-scale data and large-size backbones. We mainly attribute it to the random patch masking in previous MIM works, which fails to leverage the crucial semantic information for effective visual representation learning. To tackle this issue, we propose the Frequency \& Attention-driven Masking and Throwing Strategy (FAMT), which can extract semantic patches and reduce the number of training patches to boost model performance and training efficiency simultaneously. Specifically, FAMT utilizes the self-attention mechanism to extract semantic information from the image for masking during training in an unsupervised manner. However, attention alone could sometimes focus on inappropriate areas regarding the semantic information. Thus, we are motivated to incorporate the information from the frequency domain into the self-attention mechanism to derive the sampling weights for masking, which captures semantic patches for visual representation learning. Furthermore, we introduce a patch throwing strategy based on the derived sampling weights to reduce the training cost. FAMT can be seamlessly integrated as a plug-and-play module and surpasses previous works, \emph{e.g.} reducing the training phase time by nearly $50\%$ and improving the linear probing accuracy of MAE by $1.3\% \sim 3.9\%$ across various datasets, including CIFAR-10/100, Tiny ImageNet, and ImageNet-1K. FAMT also demonstrates superior performance in downstream detection and segmentation tasks. △ Less

Submitted 28 September, 2024; v1 submitted 28 November, 2022; originally announced November 2022.

arXiv:2211.08736 [pdf, other]

doi 10.1109/TMM.2022.3222118

AlignVE: Visual Entailment Recognition Based on Alignment Relations

Authors: Biwei Cao, Jiuxin Cao, Jie Gui, Jiayun Shen, Bo Liu, Lei He, Yuan Yan Tang, James Tin-Yau Kwok

Abstract: Visual entailment (VE) is to recognize whether the semantics of a hypothesis text can be inferred from the given premise image, which is one special task among recent emerged vision and language understanding tasks. Currently, most of the existing VE approaches are derived from the methods of visual question answering. They recognize visual entailment by quantifying the similarity between the hypo… ▽ More Visual entailment (VE) is to recognize whether the semantics of a hypothesis text can be inferred from the given premise image, which is one special task among recent emerged vision and language understanding tasks. Currently, most of the existing VE approaches are derived from the methods of visual question answering. They recognize visual entailment by quantifying the similarity between the hypothesis and premise in the content semantic features from multi modalities. Such approaches, however, ignore the VE's unique nature of relation inference between the premise and hypothesis. Therefore, in this paper, a new architecture called AlignVE is proposed to solve the visual entailment problem with a relation interaction method. It models the relation between the premise and hypothesis as an alignment matrix. Then it introduces a pooling operation to get feature vectors with a fixed size. Finally, it goes through the fully-connected layer and normalization layer to complete the classification. Experiments show that our alignment-based architecture reaches 72.45\% accuracy on SNLI-VE dataset, outperforming previous content-based models under the same settings. △ Less

Submitted 16 November, 2022; originally announced November 2022.

Comments: This paper is accepted for publication as a REGULAR paper in the IEEE Transactions on Multimedia

arXiv:2210.17180 [pdf, other]

doi 10.1109/TCSVT.2024.3395463

Automated Dominative Subspace Mining for Efficient Neural Architecture Search

Authors: Yaofo Chen, Yong Guo, Daihai Liao, Fanbing Lv, Hengjie Song, James Tin-Yau Kwok, Mingkui Tan

Abstract: Neural Architecture Search (NAS) aims to automatically find effective architectures within a predefined search space. However, the search space is often extremely large. As a result, directly searching in such a large search space is non-trivial and also very time-consuming. To address the above issues, in each search step, we seek to limit the search space to a small but effective subspace to boo… ▽ More Neural Architecture Search (NAS) aims to automatically find effective architectures within a predefined search space. However, the search space is often extremely large. As a result, directly searching in such a large search space is non-trivial and also very time-consuming. To address the above issues, in each search step, we seek to limit the search space to a small but effective subspace to boost both the search performance and search efficiency. To this end, we propose a novel Neural Architecture Search method via Dominative Subspace Mining (DSM-NAS) that finds promising architectures in automatically mined subspaces. Specifically, we first perform a global search, i.e ., dominative subspace mining, to find a good subspace from a set of candidates. Then, we perform a local search within the mined subspace to find effective architectures. More critically, we further boost search performance by taking well-designed/ searched architectures to initialize candidate subspaces. Experimental results demonstrate that DSM-NAS not only reduces the search cost but also discovers better architectures than state-of-the-art methods in various benchmark search spaces. △ Less

Submitted 6 June, 2024; v1 submitted 31 October, 2022; originally announced October 2022.

Comments: Published in IEEE TCSVT

arXiv:2209.15522 [pdf, other]

The mechanism of twin thickening and the elastic strain state of TWIP steel nanotwins

Authors: T W J Kwok, T P McAuliffe, A K Ackerman, B H Savitzky, M Danaie, C Ophus, D Dye

Abstract: A Twinning Induced Plasticity (TWIP) steel with a nominal composition of Fe-16.4Mn-0.9C-0.5Si-0.05Nb-0.05V was deformed to an engineering strain of 6\%. The strain around the deformation twins were mapped using the 4D-STEM technique. Strain mapping showed a large average elastic strain of approximately 6\% in the directions parallel and perpendicular to the twinning direction. However, the large a… ▽ More A Twinning Induced Plasticity (TWIP) steel with a nominal composition of Fe-16.4Mn-0.9C-0.5Si-0.05Nb-0.05V was deformed to an engineering strain of 6\%. The strain around the deformation twins were mapped using the 4D-STEM technique. Strain mapping showed a large average elastic strain of approximately 6\% in the directions parallel and perpendicular to the twinning direction. However, the large average strain comprised of several hot spots of even larger strains of up to 12\%. These hot spots could be attributed to a high density of sessile Frank dislocations on the twin boundary and correspond to shear stresses of 1--1.5 GPa. The strain and therefore stress fields are significantly larger than other materials known to twin and are speculated to be responsible for the early thickness saturation of TWIP steel nanotwins. The ability to keep twins extremely thin helps improve grain fragmentation, \textit{i.e.} the dynamic Hall-Petch effect, and underpins the large elongations and strain hardening rates in TWIP steels. △ Less

Submitted 30 September, 2022; originally announced September 2022.

arXiv:2209.13139 [pdf, other]

Searching a High-Performance Feature Extractor for Text Recognition Network

Authors: Hui Zhang, Quanming Yao, James T. Kwok, Xiang Bai

Abstract: Feature extractor plays a critical role in text recognition (TR), but customizing its architecture is relatively less explored due to expensive manual tweaking. In this work, inspired by the success of neural architecture search (NAS), we propose to search for suitable feature extractors. We design a domain-specific search space by exploring principles for having good feature extractors. The space… ▽ More Feature extractor plays a critical role in text recognition (TR), but customizing its architecture is relatively less explored due to expensive manual tweaking. In this work, inspired by the success of neural architecture search (NAS), we propose to search for suitable feature extractors. We design a domain-specific search space by exploring principles for having good feature extractors. The space includes a 3D-structured space for the spatial model and a transformed-based space for the sequential model. As the space is huge and complexly structured, no existing NAS algorithms can be applied. We propose a two-stage algorithm to effectively search in the space. In the first stage, we cut the space into several blocks and progressively train each block with the help of an auxiliary head. We introduce the latency constraint into the second stage and search sub-network from the trained supernet via natural gradient descent. In experiments, a series of ablation studies are performed to better understand the designed space, search algorithm, and searched architectures. We also compare the proposed method with various state-of-the-art ones on both hand-written and scene TR tasks. Extensive results show that our approach can achieve better recognition performance with less latency. △ Less

Submitted 26 September, 2022; originally announced September 2022.

arXiv:2207.14443 [pdf, other]

A Survey of Learning on Small Data: Generalization, Optimization, and Challenge

Authors: Xiaofeng Cao, Weixin Bu, Shengjun Huang, Minling Zhang, Ivor W. Tsang, Yew Soon Ong, James T. Kwok

Abstract: Learning on big data brings success for artificial intelligence (AI), but the annotation and training costs are expensive. In future, learning on small data that approximates the generalization ability of big data is one of the ultimate purposes of AI, which requires machines to recognize objectives and scenarios relying on small data as humans. A series of learning topics is going on this way suc… ▽ More Learning on big data brings success for artificial intelligence (AI), but the annotation and training costs are expensive. In future, learning on small data that approximates the generalization ability of big data is one of the ultimate purposes of AI, which requires machines to recognize objectives and scenarios relying on small data as humans. A series of learning topics is going on this way such as active learning and few-shot learning. However, there are few theoretical guarantees for their generalization performance. Moreover, most of their settings are passive, that is, the label distribution is explicitly controlled by finite training resources from known distributions. This survey follows the agnostic active sampling theory under a PAC (Probably Approximately Correct) framework to analyze the generalization error and label complexity of learning on small data in model-agnostic supervised and unsupervised fashion. Considering multiple learning communities could produce small data representation and related topics have been well surveyed, we thus subjoin novel geometric representation perspectives for small data: the Euclidean and non-Euclidean (hyperbolic) mean, where the optimization solutions including the Euclidean gradients, non-Euclidean gradients, and Stein gradient are presented and discussed. Later, multiple learning communities that may be improved by learning on small data are summarized, which yield data-efficient representations, such as transfer learning, contrastive learning, graph representation learning. Meanwhile, we find that the meta-learning may provide effective parameter update policies for learning on small data. Then, we explore multiple challenging scenarios for small data, such as the weak supervision and multi-label. Finally, multiple data applications that may benefit from efficient small data representation are surveyed. △ Less

Submitted 6 June, 2023; v1 submitted 28 July, 2022; originally announced July 2022.

arXiv:2206.15205 [pdf, other]

Black-box Generalization of Machine Teaching

Authors: Xiaofeng Cao, Yaming Guo, Ivor W. Tsang, James T. Kwok

Abstract: Hypothesis-pruning maximizes the hypothesis updates for active learning to find those desired unlabeled data. An inherent assumption is that this learning manner can derive those updates into the optimal hypothesis. However, its convergence may not be guaranteed well if those incremental updates are negative and disordered. In this paper, we introduce a black-box teaching hypothesis… ▽ More Hypothesis-pruning maximizes the hypothesis updates for active learning to find those desired unlabeled data. An inherent assumption is that this learning manner can derive those updates into the optimal hypothesis. However, its convergence may not be guaranteed well if those incremental updates are negative and disordered. In this paper, we introduce a black-box teaching hypothesis $h^\mathcal{T}$ employing a tighter slack term $\left(1+\mathcal{F}^{\mathcal{T}}(\widehat{h}_t)\right)Δ_t$ to replace the typical $2Δ_t$ for pruning. Theoretically, we prove that, under the guidance of this teaching hypothesis, the learner can converge into a tighter generalization error and label complexity bound than those non-educated learners who do not receive any guidance from a teacher:1) the generalization error upper bound can be reduced from $R(h^*)+4Δ_{T-1}$ to approximately $R(h^{\mathcal{T}})+2Δ_{T-1}$, and 2) the label complexity upper bound can be decreased from $4 θ\left(TR(h^{*})+2O(\sqrt{T})\right)$ to approximately $2θ\left(2TR(h^{\mathcal{T}})+3 O(\sqrt{T})\right)$. To be strict with our assumption, self-improvement of teaching is firstly proposed when $h^\mathcal{T}$ loosely approximates $h^*$. Against learning, we further consider two teaching scenarios: teaching a white-box and black-box learner. Experiments verify this idea and show better generalization performance than the fundamental active learning strategies, such as IWAL, IWAL-D, etc. △ Less

Submitted 20 September, 2023; v1 submitted 30 June, 2022; originally announced June 2022.

arXiv:2204.05388 [pdf, other]

The relative contributions of TWIP and TRIP to strength in fine grained medium-Mn steels

Authors: T W J Kwok, P Gong, R Rose, D Dye

Abstract: A medium Mn steel of composition Fe-4.8Mn-2.8Al-1.5Si-0.51C (wt.\%) was processed to obtain two different microstructures representing two different approaches in the hot rolling mill, resulting in equiaxed vs. a mixed equiaxed and lamellar microstructures. Both were found to exhibit a simultaneous TWIP$+$TRIP plasticity enhancing mechanism where deformation twins and $α'$-martensite formed indepe… ▽ More A medium Mn steel of composition Fe-4.8Mn-2.8Al-1.5Si-0.51C (wt.\%) was processed to obtain two different microstructures representing two different approaches in the hot rolling mill, resulting in equiaxed vs. a mixed equiaxed and lamellar microstructures. Both were found to exhibit a simultaneous TWIP$+$TRIP plasticity enhancing mechanism where deformation twins and $α'$-martensite formed independently of twinning with strain. Interrupted tensile tests were conducted in order to investigate the differences in deformation structures between the two microstructures. A constitutive model was used to find that, surprisingly, twinning contributed relatively little to the strength of the alloy, chiefly due to the fine initial slip lengths that then gave rise to relatively little opportunity for work hardening by grain subdivision. Nevertheless, with lower high-cost alloying additions than equivalent Dual Phase steels (2-3 wt\% Mn) and greater ductility, medium-Mn TWIP$+$TRIP steels still represent an attractive area for future development. △ Less

Submitted 19 August, 2022; v1 submitted 11 April, 2022; originally announced April 2022.

Comments: Edited after review, second round

arXiv:2203.06168 [pdf, other]

Cheeger Inequalities for Vertex Expansion and Reweighted Eigenvalues

Authors: Tsz Chiu Kwok, Lap Chi Lau, Kam Chuen Tung

Abstract: The classical Cheeger's inequality relates the edge conductance $φ$ of a graph and the second smallest eigenvalue $λ_2$ of the Laplacian matrix. Recently, Olesker-Taylor and Zanetti discovered a Cheeger-type inequality $ψ^2 / \log |V| \lesssim λ_2^* \lesssim ψ$ connecting the vertex expansion $ψ$ of a graph $G=(V,E)$ and the maximum reweighted second smallest eigenvalue $λ_2^*$ of the Laplacian ma… ▽ More The classical Cheeger's inequality relates the edge conductance $φ$ of a graph and the second smallest eigenvalue $λ_2$ of the Laplacian matrix. Recently, Olesker-Taylor and Zanetti discovered a Cheeger-type inequality $ψ^2 / \log |V| \lesssim λ_2^* \lesssim ψ$ connecting the vertex expansion $ψ$ of a graph $G=(V,E)$ and the maximum reweighted second smallest eigenvalue $λ_2^*$ of the Laplacian matrix. In this work, we first improve their result to $ψ^2 / \log d \lesssim λ_2^* \lesssim ψ$ where $d$ is the maximum degree in $G$, which is optimal assuming the small-set expansion conjecture. Also, the improved result holds for weighted vertex expansion, answering an open question by Olesker-Taylor and Zanetti. Building on this connection, we then develop a new spectral theory for vertex expansion. We discover that several interesting generalizations of Cheeger inequalities relating edge conductances and eigenvalues have a close analog in relating vertex expansions and reweighted eigenvalues. These include an analog of Trevisan's result on bipartiteness, an analog of higher order Cheeger's inequality, and an analog of improved Cheeger's inequality. Finally, inspired by this connection, we present negative evidence to the $0/1$-polytope edge expansion conjecture by Mihail and Vazirani. We construct $0/1$-polytopes whose graphs have very poor vertex expansion. This implies that the fastest mixing time to the uniform distribution on the vertices of these $0/1$-polytopes is almost linear in the graph size. This does not provide a counterexample to the conjecture, but this is in contrast with known positive results which proved poly-logarithmic mixing time to the uniform distribution on the vertices of subclasses of $0/1$-polytopes. △ Less

Submitted 19 September, 2022; v1 submitted 11 March, 2022; originally announced March 2022.

Comments: 65 pages, 1 figure. Minor changes

arXiv:2202.08625 [pdf, other]

Revisiting Over-smoothing in BERT from the Perspective of Graph

Authors: Han Shi, Jiahui Gao, Hang Xu, Xiaodan Liang, Zhenguo Li, Lingpeng Kong, Stephen M. S. Lee, James T. Kwok

Abstract: Recently over-smoothing phenomenon of Transformer-based models is observed in both vision and language fields. However, no existing work has delved deeper to further investigate the main cause of this phenomenon. In this work, we make the attempt to analyze the over-smoothing problem from the perspective of graph, where such problem was first discovered and explored. Intuitively, the self-attentio… ▽ More Recently over-smoothing phenomenon of Transformer-based models is observed in both vision and language fields. However, no existing work has delved deeper to further investigate the main cause of this phenomenon. In this work, we make the attempt to analyze the over-smoothing problem from the perspective of graph, where such problem was first discovered and explored. Intuitively, the self-attention matrix can be seen as a normalized adjacent matrix of a corresponding graph. Based on the above connection, we provide some theoretical analysis and find that layer normalization plays a key role in the over-smoothing issue of Transformer-based models. Specifically, if the standard deviation of layer normalization is sufficiently large, the output of Transformer stacks will converge to a specific low-rank subspace and result in over-smoothing. To alleviate the over-smoothing problem, we consider hierarchical fusion strategies, which combine the representations from different layers adaptively to make the output more diverse. Extensive experiment results on various data sets illustrate the effect of our fusion method. △ Less

Submitted 17 February, 2022; originally announced February 2022.

Comments: Accepted by ICLR 2022 (Spotlight)

arXiv:2112.12115 [pdf, other]

Tailoring the Deformation Behaviour of a Medium Mn Steel through Isothermal Intercritical Annealing

Authors: X. Xu, T. W. J. Kwok, P. Gong, D. Dye

Abstract: A novel concept of varying the strain hardening rate of a medium Mn steel with 8 wt\% Mn by varying the duration of the intercritical anneal after hot rolling was explored. It was found that the stability of the austenite phase showed an inverse square root relationship with intercritical annealing duration and that the maximum strain hardening rate showed a linear relationship with austenite stab… ▽ More A novel concept of varying the strain hardening rate of a medium Mn steel with 8 wt\% Mn by varying the duration of the intercritical anneal after hot rolling was explored. It was found that the stability of the austenite phase showed an inverse square root relationship with intercritical annealing duration and that the maximum strain hardening rate showed a linear relationship with austenite stability. The change in austenite stability was attributed to continuous Mn enrichment with increasing intercritical annealling duration. Twinned martensite was also found to be the most likely product of the martensitic transformation during deformation. △ Less

Submitted 4 April, 2022; v1 submitted 22 December, 2021; originally announced December 2021.

Comments: Updated in response to reviewer comments

arXiv:2112.01172 [pdf, other]

Strengthening $κ$-carbide steels using residual dislocation content

Authors: T. W. J. Kwok, K. M. Rahman, V. A. Vorontsov, D. Dye

Abstract: A steel with nominal composition Fe-28Mn-8Al-1.0C in mass percent was hot rolled at two temperatures, 1100 \degree C and 850 \degree C and subsequently aged at 550 \degree C for 24 h. The lower temperature rolling resulted in a yield strength increment of 299 MPa while still retaining an elongation to failure of over 30\%. The large improvement in strength was attributed to an increase in residual… ▽ More A steel with nominal composition Fe-28Mn-8Al-1.0C in mass percent was hot rolled at two temperatures, 1100 \degree C and 850 \degree C and subsequently aged at 550 \degree C for 24 h. The lower temperature rolling resulted in a yield strength increment of 299 MPa while still retaining an elongation to failure of over 30\%. The large improvement in strength was attributed to an increase in residual dislocation density which was retained even after the ageing heat treatment. A homogeneous precipitation of $κ$-carbides in both samples also showed that the high residual dislocation density did not adversely affect precipitation kinetics. These findings demonstrate that the tensile properties of this class of steel can yet be improved by optimising hot rolling process parameters. △ Less

Submitted 2 December, 2021; originally announced December 2021.

arXiv:2110.12729 [pdf, other]

doi 10.1007/s11661-021-06533-w

A scale up study on chemical segregation and the effects on tensile properties in two medium Mn steel castings

Authors: T. W. J. Kwok, C. Slater, X. Xu, C. Davis, D. Dye

Abstract: Two ingots weighing 400 g and 5 kg with nominal compositions of Fe-8Mn-4Al-2Si-0.5C-0.07V-0.05Sn were produced to investigate the effect of processing variables on microstructure development. The larger casting has a cooling rate more representative of commercial production and provides an understanding of the potential challenges arising from casting-related segregation during efforts to scale up… ▽ More Two ingots weighing 400 g and 5 kg with nominal compositions of Fe-8Mn-4Al-2Si-0.5C-0.07V-0.05Sn were produced to investigate the effect of processing variables on microstructure development. The larger casting has a cooling rate more representative of commercial production and provides an understanding of the potential challenges arising from casting-related segregation during efforts to scale up medium Mn steels, whilst the smaller casting has a high cooling rate and different segregation pattern. Sections from both ingots were homogenised at 1250 \degree C for various times to study the degree of chemical homogeneity and $δ$-ferrite dissolution. Within 2 h, the Mn segregation range (max $-$ min) decreased from 8.0 to 1.7 wt\% in the 400 g ingot and from 6.2 to 1.5 wt\% in the 5 kg ingot. Some $δ$-ferrite also remained untransformed after 2 h in both ingots but with the 5 kg ingot showing nearly three times more than the 400 g ingot. Micress modelling was carried out and good agreement was seen between predicted and measured segregation levels and distribution. After thermomechanical processing, it was found that the coarse untransformed $δ$-ferrite in the 5 kg ingot turned into coarse $δ$-ferrite stringers in the finished product, resulting in a slight decrease in yield strength. Nevertheless, rolled strips from both ingots showed $>$900 MPa yield strength, $>$1100 MPa tensile strength and $>$40\% elongation with $<$10\% difference in strength and no change in ductility when compared to a fully homogenised sample. △ Less

Submitted 25 October, 2021; originally announced October 2021.

arXiv:2109.08342 [pdf, other]

Dropout's Dream Land: Generalization from Learned Simulators to Reality

Authors: Zac Wellmer, James T. Kwok

Abstract: A World Model is a generative model used to simulate an environment. World Models have proven capable of learning spatial and temporal representations of Reinforcement Learning environments. In some cases, a World Model offers an agent the opportunity to learn entirely inside of its own dream environment. In this work we explore improving the generalization capabilities from dream environments to… ▽ More A World Model is a generative model used to simulate an environment. World Models have proven capable of learning spatial and temporal representations of Reinforcement Learning environments. In some cases, a World Model offers an agent the opportunity to learn entirely inside of its own dream environment. In this work we explore improving the generalization capabilities from dream environments to real environments (Dream2Real). We present a general approach to improve a controller's ability to transfer from a neural network dream environment to reality at little additional cost. These improvements are gained by drawing on inspiration from Domain Randomization, where the basic idea is to randomize as much of a simulator as possible without fundamentally changing the task at hand. Generally, Domain Randomization assumes access to a pre-built simulator with configurable parameters but oftentimes this is not available. By training the World Model using dropout, the dream environment is capable of creating a nearly infinite number of different dream environments. Previous use cases of dropout either do not use dropout at inference time or averages the predictions generated by multiple sampled masks (Monte-Carlo Dropout). Dropout's Dream Land leverages each unique mask to create a diverse set of dream environments. Our experimental results show that Dropout's Dream Land is an effective technique to bridge the reality gap between dream environments and reality. Furthermore, we additionally perform an extensive set of ablation studies. △ Less

Submitted 16 September, 2021; originally announced September 2021.

Comments: Published at ECML PKDD 2021

arXiv:2107.00184 [pdf, other]

Bilinear Scoring Function Search for Knowledge Graph Learning

Authors: Yongqi Zhang, Quanming Yao, James Tin-Yau Kwok

Abstract: Learning embeddings for entities and relations in knowledge graph (KG) have benefited many downstream tasks. In recent years, scoring functions, the crux of KG learning, have been human-designed to measure the plausibility of triples and capture different kinds of relations in KGs. However, as relations exhibit intricate patterns that are hard to infer before training, none of them consistently pe… ▽ More Learning embeddings for entities and relations in knowledge graph (KG) have benefited many downstream tasks. In recent years, scoring functions, the crux of KG learning, have been human-designed to measure the plausibility of triples and capture different kinds of relations in KGs. However, as relations exhibit intricate patterns that are hard to infer before training, none of them consistently perform the best on benchmark tasks. In this paper, inspired by the recent success of automated machine learning (AutoML), we search bilinear scoring functions for different KG tasks through the AutoML techniques. However, it is non-trivial to explore domain-specific information here. We first set up a search space for AutoBLM by analyzing existing scoring functions. Then, we propose a progressive algorithm (AutoBLM) and an evolutionary algorithm (AutoBLM+), which are further accelerated by filter and predictor to deal with the domain-specific properties for KG learning. Finally, we perform extensive experiments on benchmarks in KG completion, multi-hop query, and entity classification tasks. Empirical results show that the searched scoring functions are KG dependent, new to the literature, and outperform the existing scoring functions. AutoBLM+ is better than AutoBLM as the evolutionary algorithm can flexibly explore better structures in the same budget. △ Less

Submitted 4 March, 2022; v1 submitted 30 June, 2021; originally announced July 2021.

Comments: TPAMI accepted

arXiv:2106.06996 [pdf, other]

Pyramidal Dense Attention Networks for Lightweight Image Super-Resolution

Authors: Huapeng Wu, Jie Gui, Jun Zhang, James T. Kwok, Zhihui Wei

Abstract: Recently, deep convolutional neural network methods have achieved an excellent performance in image superresolution (SR), but they can not be easily applied to embedded devices due to large memory cost. To solve this problem, we propose a pyramidal dense attention network (PDAN) for lightweight image super-resolution in this paper. In our method, the proposed pyramidal dense learning can gradually… ▽ More Recently, deep convolutional neural network methods have achieved an excellent performance in image superresolution (SR), but they can not be easily applied to embedded devices due to large memory cost. To solve this problem, we propose a pyramidal dense attention network (PDAN) for lightweight image super-resolution in this paper. In our method, the proposed pyramidal dense learning can gradually increase the width of the densely connected layer inside a pyramidal dense block to extract deep features efficiently. Meanwhile, the adaptive group convolution that the number of groups grows linearly with dense convolutional layers is introduced to relieve the parameter explosion. Besides, we also present a novel joint attention to capture cross-dimension interaction between the spatial dimensions and channel dimension in an efficient way for providing rich discriminative feature representations. Extensive experimental results show that our method achieves superior performance in comparison with the state-of-the-art lightweight SR methods. △ Less

Submitted 13 June, 2021; originally announced June 2021.

arXiv:2106.06966 [pdf, other]

Feedback Pyramid Attention Networks for Single Image Super-Resolution

Authors: Huapeng Wu, Jie Gui, Jun Zhang, James T. Kwok, Zhihui Wei

Abstract: Recently, convolutional neural network (CNN) based image super-resolution (SR) methods have achieved significant performance improvement. However, most CNN-based methods mainly focus on feed-forward architecture design and neglect to explore the feedback mechanism, which usually exists in the human visual system. In this paper, we propose feedback pyramid attention networks (FPAN) to fully exploit… ▽ More Recently, convolutional neural network (CNN) based image super-resolution (SR) methods have achieved significant performance improvement. However, most CNN-based methods mainly focus on feed-forward architecture design and neglect to explore the feedback mechanism, which usually exists in the human visual system. In this paper, we propose feedback pyramid attention networks (FPAN) to fully exploit the mutual dependencies of features. Specifically, a novel feedback connection structure is developed to enhance low-level feature expression with high-level information. In our method, the output of each layer in the first stage is also used as the input of the corresponding layer in the next state to re-update the previous low-level filters. Moreover, we introduce a pyramid non-local structure to model global contextual information in different scales and improve the discriminative representation of the network. Extensive experimental results on various datasets demonstrate the superiority of our FPAN in comparison with the state-of-the-art SR methods. △ Less

Submitted 13 June, 2021; originally announced June 2021.

arXiv:2106.06326 [pdf, other]

TOHAN: A One-step Approach towards Few-shot Hypothesis Adaptation

Authors: Haoang Chi, Feng Liu, Wenjing Yang, Long Lan, Tongliang Liu, Bo Han, William K. Cheung, James T. Kwok

Abstract: In few-shot domain adaptation (FDA), classifiers for the target domain are trained with accessible labeled data in the source domain (SD) and few labeled data in the target domain (TD). However, data usually contain private information in the current era, e.g., data distributed on personal phones. Thus, the private information will be leaked if we directly access data in SD to train a target-domai… ▽ More In few-shot domain adaptation (FDA), classifiers for the target domain are trained with accessible labeled data in the source domain (SD) and few labeled data in the target domain (TD). However, data usually contain private information in the current era, e.g., data distributed on personal phones. Thus, the private information will be leaked if we directly access data in SD to train a target-domain classifier (required by FDA methods). In this paper, to thoroughly prevent the privacy leakage in SD, we consider a very challenging problem setting, where the classifier for the TD has to be trained using few labeled target data and a well-trained SD classifier, named few-shot hypothesis adaptation (FHA). In FHA, we cannot access data in SD, as a result, the private information in SD will be protected well. To this end, we propose a target orientated hypothesis adaptation network (TOHAN) to solve the FHA problem, where we generate highly-compatible unlabeled data (i.e., an intermediate domain) to help train a target-domain classifier. TOHAN maintains two deep networks simultaneously, where one focuses on learning an intermediate domain and the other takes care of the intermediate-to-target distributional adaptation and the target-risk minimization. Experimental results show that TOHAN outperforms competitive baselines significantly. △ Less

Submitted 7 September, 2022; v1 submitted 11 June, 2021; originally announced June 2021.

arXiv:2103.10782 [pdf, other]

doi 10.1007/s11661-021-06534-9

Microstructure evolution and tensile behaviour of a cold rolled 8 wt\% Mn medium manganese steel

Authors: Thomas WJ Kwok, Peng Gong, Xin Xu, John Nutter, W Mark Rainforth, David Dye

Abstract: A novel medium manganese steel named Novalloy with composition Fe-8.3Mn-3.8Al-1.8Si-0.5C-0.06V-0.05Sn was developed and thermomechanically processed through hot rolling and intercritical annealing. The steel possessed a yield strength of 1 GPa, tensile strength of 1.13 GPa and ductility of 41\%. In order to study the effect of cold rolling after intercritical annealing on subsequent tensile proper… ▽ More A novel medium manganese steel named Novalloy with composition Fe-8.3Mn-3.8Al-1.8Si-0.5C-0.06V-0.05Sn was developed and thermomechanically processed through hot rolling and intercritical annealing. The steel possessed a yield strength of 1 GPa, tensile strength of 1.13 GPa and ductility of 41\%. In order to study the effect of cold rolling after intercritical annealing on subsequent tensile properties, the steel was further cold rolled up to 20\% reduction. After cold rolling, it was observed that the strain hardening rate increased continuously with increasing cold rolling reduction but without a significant drop in ductility during subsequent tensile tests. The microstructural evolution with cold rolling reduction was analysed to understand the mechanisms behind this enhanced TRIP effect. It was found that cold rolling activated additional twinning systems which provided a large number of potent nucleation sites for strain induced martensite to form during subsequent tensile tests. △ Less

Submitted 29 July, 2021; v1 submitted 19 March, 2021; originally announced March 2021.

Comments: Updated in response to reviewer comments

Journal ref: Metall Mater Trans A, 2022

arXiv:2102.12871 [pdf, other]

SparseBERT: Rethinking the Importance Analysis in Self-attention

Authors: Han Shi, Jiahui Gao, Xiaozhe Ren, Hang Xu, Xiaodan Liang, Zhenguo Li, James T. Kwok

Abstract: Transformer-based models are popularly used in natural language processing (NLP). Its core component, self-attention, has aroused widespread interest. To understand the self-attention mechanism, a direct method is to visualize the attention map of a pre-trained model. Based on the patterns observed, a series of efficient Transformers with different sparse attention masks have been proposed. From a… ▽ More Transformer-based models are popularly used in natural language processing (NLP). Its core component, self-attention, has aroused widespread interest. To understand the self-attention mechanism, a direct method is to visualize the attention map of a pre-trained model. Based on the patterns observed, a series of efficient Transformers with different sparse attention masks have been proposed. From a theoretical perspective, universal approximability of Transformer-based models is also recently proved. However, the above understanding and analysis of self-attention is based on a pre-trained model. To rethink the importance analysis in self-attention, we study the significance of different positions in attention matrix during pre-training. A surprising result is that diagonal elements in the attention map are the least important compared with other attention positions. We provide a proof showing that these diagonal elements can indeed be removed without deteriorating model performance. Furthermore, we propose a Differentiable Attention Mask (DAM) algorithm, which further guides the design of the SparseBERT. Extensive experiments verify our interesting findings and illustrate the effect of the proposed algorithm. △ Less

Submitted 1 July, 2021; v1 submitted 25 February, 2021; originally announced February 2021.

Comments: Accepted by ICML 2021

arXiv:2011.04406 [pdf, other]

A Survey of Label-noise Representation Learning: Past, Present and Future

Authors: Bo Han, Quanming Yao, Tongliang Liu, Gang Niu, Ivor W. Tsang, James T. Kwok, Masashi Sugiyama

Abstract: Classical machine learning implicitly assumes that labels of the training data are sampled from a clean distribution, which can be too restrictive for real-world scenarios. However, statistical-learning-based methods may not train deep learning models robustly with these noisy labels. Therefore, it is urgent to design Label-Noise Representation Learning (LNRL) methods for robustly training deep mo… ▽ More Classical machine learning implicitly assumes that labels of the training data are sampled from a clean distribution, which can be too restrictive for real-world scenarios. However, statistical-learning-based methods may not train deep learning models robustly with these noisy labels. Therefore, it is urgent to design Label-Noise Representation Learning (LNRL) methods for robustly training deep models with noisy labels. To fully understand LNRL, we conduct a survey study. We first clarify a formal definition for LNRL from the perspective of machine learning. Then, via the lens of learning theory and empirical study, we figure out why noisy labels affect deep models' performance. Based on the theoretical guidance, we categorize different LNRL methods into three directions. Under this unified taxonomy, we provide a thorough discussion of the pros and cons of different categories. More importantly, we summarize the essential components of robust LNRL, which can spark new directions. Lastly, we propose possible research directions within LNRL, such as new datasets, instance-dependent LNRL, and adversarial LNRL. We also envision potential directions beyond LNRL, such as learning with feature-noise, preference-noise, domain-noise, similarity-noise, graph-noise and demonstration-noise. △ Less

Submitted 20 February, 2021; v1 submitted 9 November, 2020; originally announced November 2020.

Comments: The draft is kept updating; any comments and suggestions are welcome

arXiv:2008.06542 [pdf, other]

A Scalable, Adaptive and Sound Nonconvex Regularizer for Low-rank Matrix Completion

Authors: Yaqing Wang, Quanming Yao, James T. Kwok

Abstract: Matrix learning is at the core of many machine learning problems. A number of real-world applications such as collaborative filtering and text mining can be formulated as a low-rank matrix completion problem, which recovers incomplete matrix using low-rank assumptions. To ensure that the matrix solution has a low rank, a recent trend is to use nonconvex regularizers that adaptively penalize sing… ▽ More Matrix learning is at the core of many machine learning problems. A number of real-world applications such as collaborative filtering and text mining can be formulated as a low-rank matrix completion problem, which recovers incomplete matrix using low-rank assumptions. To ensure that the matrix solution has a low rank, a recent trend is to use nonconvex regularizers that adaptively penalize singular values. They offer good recovery performance and have nice theoretical properties, but are computationally expensive due to repeated access to individual singular values. In this paper, based on the key insight that adaptive shrinkage on singular values improve empirical performance, we propose a new nonconvex low-rank regularizer called "nuclear norm minus Frobenius norm" regularizer, which is scalable, adaptive and sound. We first show it provably holds the adaptive shrinkage property. Further, we discover its factored form which bypasses the computation of singular values and allows fast optimization by general optimization algorithms. Stable recovery and convergence are guaranteed. Extensive low-rank matrix completion experiments on a number of synthetic and real-world data sets show that the proposed method obtains state-of-the-art recovery performance while being the fastest in comparison to existing low-rank matrix learning methods. △ Less

Submitted 22 February, 2021; v1 submitted 14 August, 2020; originally announced August 2020.

Comments: WebConf 2021

arXiv:2006.09117 [pdf, other]

End-to-End Real-time Catheter Segmentation with Optical Flow-Guided Warping during Endovascular Intervention

Authors: Anh Nguyen, Dennis Kundrat, Giulio Dagnino, Wenqiang Chi, Mohamed E. M. K. Abdelaziz, Yao Guo, YingLiang Ma, Trevor M. Y. Kwok, Celia Riga, Guang-Zhong Yang

Abstract: Accurate real-time catheter segmentation is an important pre-requisite for robot-assisted endovascular intervention. Most of the existing learning-based methods for catheter segmentation and tracking are only trained on small-scale datasets or synthetic data due to the difficulties of ground-truth annotation. Furthermore, the temporal continuity in intraoperative imaging sequences is not fully uti… ▽ More Accurate real-time catheter segmentation is an important pre-requisite for robot-assisted endovascular intervention. Most of the existing learning-based methods for catheter segmentation and tracking are only trained on small-scale datasets or synthetic data due to the difficulties of ground-truth annotation. Furthermore, the temporal continuity in intraoperative imaging sequences is not fully utilised. In this paper, we present FW-Net, an end-to-end and real-time deep learning framework for endovascular intervention. The proposed FW-Net has three modules: a segmentation network with encoder-decoder architecture, a flow network to extract optical flow information, and a novel flow-guided warping function to learn the frame-to-frame temporal continuity. We show that by effectively learning temporal continuity, the network can successfully segment and track the catheters in real-time sequences using only raw ground-truth for training. Detailed validation results confirm that our FW-Net outperforms state-of-the-art techniques while achieving real-time performance. △ Less

Submitted 16 June, 2020; originally announced June 2020.

Comments: ICRA 2020

arXiv:2004.03982 [pdf]

4D-STEM elastic stress state characterisation of a TWIP steel nanotwin

Authors: T P McAuliffe, A K Ackerman, B H Savitzky, T W J Kwok, M Danaie, C Ophus, D Dye

Abstract: We measure the stress state in and around a deformation nanotwin in a twinning-induced plasticity (TWIP) steel. Using four-dimensional scanning transmission electron microscopy (4D-STEM), we measure the elastic strain field in a 68.2-by-83.1 nm area of interest with a scan step of 0.36 nm and a diffraction limit resolution of 0.73 nm. The stress field in and surrounding the twin matches the form e… ▽ More We measure the stress state in and around a deformation nanotwin in a twinning-induced plasticity (TWIP) steel. Using four-dimensional scanning transmission electron microscopy (4D-STEM), we measure the elastic strain field in a 68.2-by-83.1 nm area of interest with a scan step of 0.36 nm and a diffraction limit resolution of 0.73 nm. The stress field in and surrounding the twin matches the form expected from analytical theory and is on the order of 15 GPa, close to the theoretical strength of the material. We infer that the measured back-stress limits twin thickening, providing a rationale for why TWIP steel twins remain thin during deformation, continuously dividing grains to give substantial work hardening. Our results support modern mechanistic understanding of the influence of twinning on crack propagation and embrittlement in TWIP steels. △ Less

Submitted 20 October, 2020; v1 submitted 8 April, 2020; originally announced April 2020.

Comments: After peer review comments and resubmission

arXiv:1911.11322 [pdf, other]

Effective Decoding in Graph Auto-Encoder using Triadic Closure

Authors: Han Shi, Haozheng Fan, James T. Kwok

Abstract: The (variational) graph auto-encoder and its variants have been popularly used for representation learning on graph-structured data. While the encoder is often a powerful graph convolutional network, the decoder reconstructs the graph structure by only considering two nodes at a time, thus ignoring possible interactions among edges. On the other hand, structured prediction, which considers the who… ▽ More The (variational) graph auto-encoder and its variants have been popularly used for representation learning on graph-structured data. While the encoder is often a powerful graph convolutional network, the decoder reconstructs the graph structure by only considering two nodes at a time, thus ignoring possible interactions among edges. On the other hand, structured prediction, which considers the whole graph simultaneously, is computationally expensive. In this paper, we utilize the well-known triadic closure property which is exhibited in many real-world networks. We propose the triad decoder, which considers and predicts the three edges involved in a local triad together. The triad decoder can be readily used in any graph-based auto-encoder. In particular, we incorporate this to the (variational) graph auto-encoder. Experiments on link prediction, node clustering and graph generation show that the use of triads leads to more accurate prediction, clustering and better preservation of the graph characteristics. △ Less

Submitted 25 November, 2019; originally announced November 2019.

Comments: Accepted by AAAI 2020

arXiv:1911.09336 [pdf, other]

Bridging the Gap between Sample-based and One-shot Neural Architecture Search with BONAS

Authors: Han Shi, Renjie Pi, Hang Xu, Zhenguo Li, James T. Kwok, Tong Zhang

Abstract: Neural Architecture Search (NAS) has shown great potentials in finding better neural network designs. Sample-based NAS is the most reliable approach which aims at exploring the search space and evaluating the most promising architectures. However, it is computationally very costly. As a remedy, the one-shot approach has emerged as a popular technique for accelerating NAS using weight-sharing. Howe… ▽ More Neural Architecture Search (NAS) has shown great potentials in finding better neural network designs. Sample-based NAS is the most reliable approach which aims at exploring the search space and evaluating the most promising architectures. However, it is computationally very costly. As a remedy, the one-shot approach has emerged as a popular technique for accelerating NAS using weight-sharing. However, due to the weight-sharing of vastly different networks, the one-shot approach is less reliable than the sample-based approach. In this work, we propose BONAS (Bayesian Optimized Neural Architecture Search), a sample-based NAS framework which is accelerated using weight-sharing to evaluate multiple related architectures simultaneously. Specifically, we apply Graph Convolutional Network predictor as a surrogate model for Bayesian Optimization to select multiple related candidate models in each iteration. We then apply weight-sharing to train multiple candidate models simultaneously. This approach not only accelerates the traditional sample-based approach significantly, but also keeps its reliability. This is because weight-sharing among related architectures are more reliable than those in the one-shot approach. Extensive experiments are conducted to verify the effectiveness of our method over many competing algorithms. △ Less

Submitted 24 November, 2020; v1 submitted 21 November, 2019; originally announced November 2019.

Comments: Accepted by NeurIPS 2020

arXiv:1908.07258 [pdf, other]

doi 10.1016/j.msea.2020.139258

Design of a High Strength, High Ductility 12 wt% Mn Medium Manganese Steel With Hierarchical Deformation Behaviour

Authors: T W J Kwok, K M Rahman, X Xu, I Bantounas, J F Kelleher, S Daswari, T Alam, R Banerjee, D Dye

Abstract: A novel medium Mn steel of composition Fe-12Mn-4.8Al-2Si-0.32C-0.3V was manufactured with 1.09 GPa yield strength, 1.26 GPa tensile strength and 54% elongation. The thermomechanical process route was designed to be industrially translatable and consists of hot and then warm rolling before a 30 min intercritical anneal. The resulting microstructure comprised of coarse elongated austenite grains in… ▽ More A novel medium Mn steel of composition Fe-12Mn-4.8Al-2Si-0.32C-0.3V was manufactured with 1.09 GPa yield strength, 1.26 GPa tensile strength and 54% elongation. The thermomechanical process route was designed to be industrially translatable and consists of hot and then warm rolling before a 30 min intercritical anneal. The resulting microstructure comprised of coarse elongated austenite grains in the rolling direction surrounded by necklace layers of fine austenite and ferrite grains. The tensile behaviour was investigated by in-situ neutron diffraction and the evolution of microstructure studied with Electron Backscattered Diffraction (EBSD). It was found that the coarse austenite grains contributed to the first stage of strain hardening by transforming into martensite and the fine austenite necklace grains contributed to the second stage of strain hardening by a mixture of twinning and transformation induced plasticity (TWIP and TRIP) mechanisms. This hierarchical deformation behaviour contributed to the exceptional ductility of this steel. △ Less

Submitted 21 April, 2020; v1 submitted 20 August, 2019; originally announced August 2019.

Comments: Updated on resubmission, minor clarifications

Journal ref: Mater. Sci. Eng. A 782:139258, 2020

arXiv:1905.10936 [pdf, other]

Communication-Efficient Distributed Blockwise Momentum SGD with Error-Feedback

Authors: Shuai Zheng, Ziyue Huang, James T. Kwok

Abstract: Communication overhead is a major bottleneck hampering the scalability of distributed machine learning systems. Recently, there has been a surge of interest in using gradient compression to improve the communication efficiency of distributed neural network training. Using 1-bit quantization, signSGD with majority vote achieves a 32x reduction on communication cost. However, its convergence is base… ▽ More Communication overhead is a major bottleneck hampering the scalability of distributed machine learning systems. Recently, there has been a surge of interest in using gradient compression to improve the communication efficiency of distributed neural network training. Using 1-bit quantization, signSGD with majority vote achieves a 32x reduction on communication cost. However, its convergence is based on unrealistic assumptions and can diverge in practice. In this paper, we propose a general distributed compressed SGD with Nesterov's momentum. We consider two-way compression, which compresses the gradients both to and from workers. Convergence analysis on nonconvex problems for general gradient compressors is provided. By partitioning the gradient into blocks, a blockwise compressor is introduced such that each gradient block is compressed and transmitted in 1-bit format with a scaling factor, leading to a nearly 32x reduction on communication. Experimental results show that the proposed method converges as fast as full-precision distributed momentum SGD and achieves the same testing accuracy. In particular, on distributed ResNet training with 7 workers on the ImageNet, the proposed algorithm achieves the same testing accuracy as momentum SGD using full-precision gradients, but with $46\%$ less wall clock time. △ Less

Submitted 28 October, 2019; v1 submitted 26 May, 2019; originally announced May 2019.

Comments: NeurIPS 2019

arXiv:1905.09899 [pdf, other]

Blockwise Adaptivity: Faster Training and Better Generalization in Deep Learning

Authors: Shuai Zheng, James T. Kwok

Abstract: Stochastic methods with coordinate-wise adaptive stepsize (such as RMSprop and Adam) have been widely used in training deep neural networks. Despite their fast convergence, they can generalize worse than stochastic gradient descent. In this paper, by revisiting the design of Adagrad, we propose to split the network parameters into blocks, and use a blockwise adaptive stepsize. Intuitively, blockwi… ▽ More Stochastic methods with coordinate-wise adaptive stepsize (such as RMSprop and Adam) have been widely used in training deep neural networks. Despite their fast convergence, they can generalize worse than stochastic gradient descent. In this paper, by revisiting the design of Adagrad, we propose to split the network parameters into blocks, and use a blockwise adaptive stepsize. Intuitively, blockwise adaptivity is less aggressive than adaptivity to individual coordinates, and can have a better balance between adaptivity and generalization. We show theoretically that the proposed blockwise adaptive gradient descent has comparable convergence rate as its counterpart with coordinate-wise adaptive stepsize, but is faster up to some constant. We also study its uniform stability and show that blockwise adaptivity can lead to lower generalization error than coordinate-wise adaptivity. Experimental results show that blockwise adaptive gradient descent converges faster and improves generalization performance over Nesterov's accelerated gradient and Adam. △ Less

Submitted 23 May, 2019; originally announced May 2019.

arXiv:1904.03213 [pdf, ps, other]

Spectral analysis of matrix scaling and operator scaling

Authors: Tsz Chiu Kwok, Lap Chi Lau, Akshay Ramachandran

Abstract: We present a spectral analysis for matrix scaling and operator scaling. We prove that if the input matrix or operator has a spectral gap, then a natural gradient flow has linear convergence. This implies that a simple gradient descent algorithm also has linear convergence under the same assumption. The spectral gap condition for operator scaling is closely related to the notion of quantum expander… ▽ More We present a spectral analysis for matrix scaling and operator scaling. We prove that if the input matrix or operator has a spectral gap, then a natural gradient flow has linear convergence. This implies that a simple gradient descent algorithm also has linear convergence under the same assumption. The spectral gap condition for operator scaling is closely related to the notion of quantum expander studied in quantum information theory. The spectral analysis also provides bounds on some important quantities of the scaling problems, such as the condition number of the scaling solution and the capacity of the matrix and operator. These bounds can be used in various applications of scaling problems, including matrix scaling on expander graphs, permanent lower bounds on random matrices, the Paulsen problem on random frames, and Brascamp-Lieb constants on random operators. In some applications, the inputs of interest satisfy the spectral condition and we prove significantly stronger bounds than the worst case bounds. △ Less

Submitted 5 April, 2019; originally announced April 2019.

arXiv:1903.03253 [pdf, other]

doi 10.1109/TIP.2020.2980980

General Convolutional Sparse Coding with Unknown Noise

Authors: Yaqing Wang, James T. Kwok, Lionel M. Ni

Abstract: Convolutional sparse coding (CSC) can learn representative shift-invariant patterns from multiple kinds of data. However, existing CSC methods can only model noises from Gaussian distribution, which is restrictive and unrealistic. In this paper, we propose a general CSC model capable of dealing with complicated unknown noise. The noise is now modeled by Gaussian mixture model, which can approximat… ▽ More Convolutional sparse coding (CSC) can learn representative shift-invariant patterns from multiple kinds of data. However, existing CSC methods can only model noises from Gaussian distribution, which is restrictive and unrealistic. In this paper, we propose a general CSC model capable of dealing with complicated unknown noise. The noise is now modeled by Gaussian mixture model, which can approximate any continuous probability density function. We use the expectation-maximization algorithm to solve the problem and design an efficient method for the weighted CSC problem in maximization step. The crux is to speed up the convolution in the frequency domain while keeping the other computation involving weight matrix in the spatial domain. Besides, we simultaneously update the dictionary and codes by nonconvex accelerated proximal gradient algorithm without bringing in extra alternating loops. The resultant method obtains comparable time and space complexity compared with existing CSC methods. Extensive experiments on synthetic and real noisy biomedical data sets validate that our method can model noise effectively and obtain high-quality filters and representation. △ Less

Submitted 7 March, 2019; originally announced March 2019.

arXiv:1811.09491 [pdf, other]

Differential Private Stack Generalization with an Application to Diabetes Prediction

Authors: Quanming Yao, Xiawei Guo, James T. Kwok, WeiWei Tu, Yuqiang Chen, Wenyuan Dai, Qiang Yang

Abstract: To meet the standard of differential privacy, noise is usually added into the original data, which inevitably deteriorates the predicting performance of subsequent learning algorithms. In this paper, motivated by the success of improving predicting performance by ensemble learning, we propose to enhance privacy-preserving logistic regression by stacking. We show that this can be done either by sam… ▽ More To meet the standard of differential privacy, noise is usually added into the original data, which inevitably deteriorates the predicting performance of subsequent learning algorithms. In this paper, motivated by the success of improving predicting performance by ensemble learning, we propose to enhance privacy-preserving logistic regression by stacking. We show that this can be done either by sample-based or feature-based partitioning. However, we prove that when privacy-budgets are the same, feature-based partitioning requires fewer samples than sample-based one, and thus likely has better empirical performance. As transfer learning is difficult to be integrated with a differential privacy guarantee, we further combine the proposed method with hypothesis transfer learning to address the problem of learning across different organizations. Finally, we not only demonstrate the effectiveness of our method on two benchmark data sets, i.e., MNIST and NEWS20, but also apply it into a real application of cross-organizational diabetes prediction from RUIJIN data set, where privacy is of significant concern. △ Less

Submitted 2 June, 2019; v1 submitted 23 November, 2018; originally announced November 2018.

arXiv:1807.08725 [pdf, other]

FasTer: Fast Tensor Completion with Nonconvex Regularization

Authors: Quanming Yao, James T Kwok, Bo Han

Abstract: Low-rank tensor completion problem aims to recover a tensor from limited observations, which has many real-world applications. Due to the easy optimization, the convex overlapping nuclear norm has been popularly used for tensor completion. However, it over-penalizes top singular values and lead to biased estimations. In this paper, we propose to use the nonconvex regularizer, which can less penali… ▽ More Low-rank tensor completion problem aims to recover a tensor from limited observations, which has many real-world applications. Due to the easy optimization, the convex overlapping nuclear norm has been popularly used for tensor completion. However, it over-penalizes top singular values and lead to biased estimations. In this paper, we propose to use the nonconvex regularizer, which can less penalize large singular values, instead of the convex one for tensor completion. However, as the new regularizer is nonconvex and overlapped with each other, existing algorithms are either too slow or suffer from the huge memory cost. To address these issues, we develop an efficient and scalable algorithm, which is based on the proximal average (PA) algorithm, for real-world problems. Compared with the direct usage of PA algorithm, the proposed algorithm runs orders faster and needs orders less space. We further speed up the proposed algorithm with the acceleration technique, and show the convergence to critical points is still guaranteed. Experimental comparisons of the proposed approach are made with various other tensor completion approaches. Empirical results show that the proposed algorithm is very fast and can produce much better recovery performance. △ Less

Submitted 23 January, 2019; v1 submitted 23 July, 2018; originally announced July 2018.

arXiv:1806.02927 [pdf, other]

Lightweight Stochastic Optimization for Minimizing Finite Sums with Infinite Data

Authors: Shuai Zheng, James T. Kwok

Abstract: Variance reduction has been commonly used in stochastic optimization. It relies crucially on the assumption that the data set is finite. However, when the data are imputed with random noise as in data augmentation, the perturbed data set be- comes essentially infinite. Recently, the stochastic MISO (S-MISO) algorithm is introduced to address this expected risk minimization problem. Though it conve… ▽ More Variance reduction has been commonly used in stochastic optimization. It relies crucially on the assumption that the data set is finite. However, when the data are imputed with random noise as in data augmentation, the perturbed data set be- comes essentially infinite. Recently, the stochastic MISO (S-MISO) algorithm is introduced to address this expected risk minimization problem. Though it converges faster than SGD, a significant amount of memory is required. In this pa- per, we propose two SGD-like algorithms for expected risk minimization with random perturbation, namely, stochastic sample average gradient (SSAG) and stochastic SAGA (S-SAGA). The memory cost of SSAG does not depend on the sample size, while that of S-SAGA is the same as those of variance reduction methods on un- perturbed data. Theoretical analysis and experimental results on logistic regression and AUC maximization show that SSAG has faster convergence rate than SGD with comparable space requirement, while S-SAGA outperforms S-MISO in terms of both iteration complexity and storage. △ Less

Submitted 7 June, 2018; originally announced June 2018.

Comments: To appear in ICML 2018

arXiv:1805.01891 [pdf, other]

Power Law in Sparsified Deep Neural Networks

Authors: Lu Hou, James T. Kwok

Abstract: The power law has been observed in the degree distributions of many biological neural networks. Sparse deep neural networks, which learn an economical representation from the data, resemble biological neural networks in many ways. In this paper, we study if these artificial networks also exhibit properties of the power law. Experimental results on two popular deep learning models, namely, multilay… ▽ More The power law has been observed in the degree distributions of many biological neural networks. Sparse deep neural networks, which learn an economical representation from the data, resemble biological neural networks in many ways. In this paper, we study if these artificial networks also exhibit properties of the power law. Experimental results on two popular deep learning models, namely, multilayer perceptrons and convolutional neural networks, are affirmative. The power law is also naturally related to preferential attachment. To study the dynamical properties of deep networks in continual learning, we propose an internal preferential attachment model to explain how the network topology evolves. Experimental results show that with the arrival of a new task, the new connections made follow this preferential attachment process. △ Less

Submitted 4 May, 2018; originally announced May 2018.

arXiv:1804.10366 [pdf, other]

Online Convolutional Sparse Coding with Sample-Dependent Dictionary

Authors: Yaqing Wang, Quanming Yao, James T. Kwok, Lionel M. Ni

Abstract: Convolutional sparse coding (CSC) has been popularly used for the learning of shift-invariant dictionaries in image and signal processing. However, existing methods have limited scalability. In this paper, instead of convolving with a dictionary shared by all samples, we propose the use of a sample-dependent dictionary in which filters are obtained as linear combinations of a small set of base fil… ▽ More Convolutional sparse coding (CSC) has been popularly used for the learning of shift-invariant dictionaries in image and signal processing. However, existing methods have limited scalability. In this paper, instead of convolving with a dictionary shared by all samples, we propose the use of a sample-dependent dictionary in which filters are obtained as linear combinations of a small set of base filters learned from the data. This added flexibility allows a large number of sample-dependent patterns to be captured, while the resultant model can still be efficiently learned by online learning. Extensive experimental results show that the proposed method outperforms existing CSC algorithms with significantly reduced time and space requirements. △ Less

Submitted 7 June, 2018; v1 submitted 27 April, 2018; originally announced April 2018.

Comments: Accepted by ICML-2018

arXiv:1802.08635 [pdf, other]

Loss-aware Weight Quantization of Deep Networks

Authors: Lu Hou, James T. Kwok

Abstract: The huge size of deep networks hinders their use in small computing devices. In this paper, we consider compressing the network by weight quantization. We extend a recently proposed loss-aware weight binarization scheme to ternarization, with possibly different scaling parameters for the positive and negative weights, and m-bit (where m > 2) quantization. Experiments on feedforward and recurrent n… ▽ More The huge size of deep networks hinders their use in small computing devices. In this paper, we consider compressing the network by weight quantization. We extend a recently proposed loss-aware weight binarization scheme to ternarization, with possibly different scaling parameters for the positive and negative weights, and m-bit (where m > 2) quantization. Experiments on feedforward and recurrent neural networks show that the proposed scheme outperforms state-of-the-art weight quantization algorithms, and is as accurate (or even more accurate) than the full-precision network. △ Less

Submitted 10 May, 2018; v1 submitted 23 February, 2018; originally announced February 2018.

arXiv:1712.08451 [pdf]

doi 10.1088/1361-6552/aaccdb

Testing the validity of the Lorentz factor

Authors: H Broomfield, J Hirst, Theodoros Vafeiadis, Markus Joos, A Singh, M Raven, T K Chung, J Harrow, T Kwok, J Li, K Tsui, A Tsui, R Perkins, H Mandelstam, D Khoo, J Southwell, J Martin-Halls, D Townsend, H Watson

Abstract: Our proposed experiment aimed to test the validity of the Lorentz factor with two methods: The time of flight (TOF) of various particles at different momenta and the decay rate of pions at different momenta. Due to the high sensitivity required for the second method the results were inconclusive, therefore we report only on the results of the first method. Our proposed experiment aimed to test the validity of the Lorentz factor with two methods: The time of flight (TOF) of various particles at different momenta and the decay rate of pions at different momenta. Due to the high sensitivity required for the second method the results were inconclusive, therefore we report only on the results of the first method. △ Less

Submitted 19 December, 2017; originally announced December 2017.

Comments: The authors of the paper are winners of the CERN Beamline for Schools <https://voisins.cern/en/offre/bl4s> 2016 competition. To be submitted to the journal Physics Education

arXiv:1710.07205 [pdf, other]

Scalable Robust Matrix Factorization with Nonconvex Loss

Authors: Quanming Yao, James T. Kwok

Abstract: Robust matrix factorization (RMF), which uses the $\ell_1$-loss, often outperforms standard matrix factorization using the $\ell_2$-loss, particularly when outliers are present. The state-of-the-art RMF solver is the RMF-MM algorithm, which, however, cannot utilize data sparsity. Moreover, sometimes even the (convex) $\ell_1$-loss is not robust enough. In this paper, we propose the use of nonconve… ▽ More Robust matrix factorization (RMF), which uses the $\ell_1$-loss, often outperforms standard matrix factorization using the $\ell_2$-loss, particularly when outliers are present. The state-of-the-art RMF solver is the RMF-MM algorithm, which, however, cannot utilize data sparsity. Moreover, sometimes even the (convex) $\ell_1$-loss is not robust enough. In this paper, we propose the use of nonconvex loss to enhance robustness. To address the resultant difficult optimization problem, we use majorization-minimization (MM) optimization and propose a new MM surrogate. To improve scalability, we exploit data sparsity and optimize the surrogate via its dual with the accelerated proximal gradient algorithm. The resultant algorithm has low time and space complexities and is guaranteed to converge to a critical point. Extensive experiments demonstrate its superiority over the state-of-the-art in terms of both accuracy and scalability. △ Less

Submitted 23 September, 2018; v1 submitted 19 October, 2017; originally announced October 2017.

arXiv:1710.02587 [pdf, ps, other]

The Paulsen Problem, Continuous Operator Scaling, and Smoothed Analysis

Authors: Tsz Chiu Kwok, Lap Chi Lau, Yin Tat Lee, Akshay Ramachandran

Abstract: The Paulsen problem is a basic open problem in operator theory: Given vectors $u_1, \ldots, u_n \in \mathbb R^d$ that are $ε$-nearly satisfying the Parseval's condition and the equal norm condition, is it close to a set of vectors $v_1, \ldots, v_n \in \mathbb R^d$ that exactly satisfy the Parseval's condition and the equal norm condition? Given $u_1, \ldots, u_n$, the squared distance (to the set… ▽ More The Paulsen problem is a basic open problem in operator theory: Given vectors $u_1, \ldots, u_n \in \mathbb R^d$ that are $ε$-nearly satisfying the Parseval's condition and the equal norm condition, is it close to a set of vectors $v_1, \ldots, v_n \in \mathbb R^d$ that exactly satisfy the Parseval's condition and the equal norm condition? Given $u_1, \ldots, u_n$, the squared distance (to the set of exact solutions) is defined as $\inf_{v} \sum_{i=1}^n \| u_i - v_i \|_2^2$ where the infimum is over the set of exact solutions. Previous results show that the squared distance of any $ε$-nearly solution is at most $O({\rm{poly}}(d,n,ε))$ and there are $ε$-nearly solutions with squared distance at least $Ω(dε)$. The fundamental open question is whether the squared distance can be independent of the number of vectors $n$. We answer this question affirmatively by proving that the squared distance of any $ε$-nearly solution is $O(d^{13/2} ε)$. Our approach is based on a continuous version of the operator scaling algorithm and consists of two parts. First, we define a dynamical system based on operator scaling and use it to prove that the squared distance of any $ε$-nearly solution is $O(d^2 n ε)$. Then, we show that by randomly perturbing the input vectors, the dynamical system will converge faster and the squared distance of an $ε$-nearly solution is $O(d^{5/2} ε)$ when $n$ is large enough and $ε$ is small enough. To analyze the convergence of the dynamical system, we develop some new techniques in lower bounding the operator capacity, a concept introduced by Gurvits to analyze the operator scaling algorithm. △ Less

Submitted 8 November, 2017; v1 submitted 6 October, 2017; originally announced October 2017.

Comments: Added Subsection 1.4; Incorporated comments and fixed typos; Minor changes in various places

arXiv:1708.01265 [pdf, other]

doi 10.1088/1475-7516/2018/01/001

Seasonal Variation of the Underground Cosmic Muon Flux Observed at Daya Bay

Authors: F. P. An, A. B. Balantekin, H. R. Band, M. Bishai, S. Blyth, D. Cao, G. F. Cao, J. Cao, Y. L. Chan, J. F. Chang, Y. Chang, H. S. Chen, Q. Y. Chen, S. M. Chen, Y. X. Chen, Y. Chen, J. Cheng, Z. K. Cheng, J. J. Cherwinka, M. C. Chu, A. Chukanov, J. P. Cummings, Y. Y. Ding, M. V. Diwan, M. Dolgareva , et al. (179 additional authors not shown)

Abstract: The Daya Bay Experiment consists of eight identically designed detectors located in three underground experimental halls named as EH1, EH2, EH3, with 250, 265 and 860 meters of water equivalent vertical overburden, respectively. Cosmic muon events have been recorded over a two-year period. The underground muon rate is observed to be positively correlated with the effective atmospheric temperature… ▽ More The Daya Bay Experiment consists of eight identically designed detectors located in three underground experimental halls named as EH1, EH2, EH3, with 250, 265 and 860 meters of water equivalent vertical overburden, respectively. Cosmic muon events have been recorded over a two-year period. The underground muon rate is observed to be positively correlated with the effective atmospheric temperature and to follow a seasonal modulation pattern. The correlation coefficient $α$, describing how a variation in the muon rate relates to a variation in the effective atmospheric temperature, is found to be $α_{\text{EH1}} = 0.362\pm0.031$, $α_{\text{EH2}} = 0.433\pm0.038$ and $α_{\text{EH3}} = 0.641\pm0.057$ for each experimental hall. △ Less

Submitted 8 January, 2018; v1 submitted 3 August, 2017; originally announced August 2017.

Comments: Updated to be identical to the published version

Journal ref: JCAP01(2018)001

arXiv:1708.00146 [pdf, other]

Large-Scale Low-Rank Matrix Learning with Nonconvex Regularizers

Authors: Quanming Yao, James T. Kwok, Taifeng Wang, Tie-Yan Liu

Abstract: Low-rank modeling has many important applications in computer vision and machine learning. While the matrix rank is often approximated by the convex nuclear norm, the use of nonconvex low-rank regularizers has demonstrated better empirical performance. However, the resulting optimization problem is much more challenging. Recent state-of-the-art requires an expensive full SVD in each iteration. In… ▽ More Low-rank modeling has many important applications in computer vision and machine learning. While the matrix rank is often approximated by the convex nuclear norm, the use of nonconvex low-rank regularizers has demonstrated better empirical performance. However, the resulting optimization problem is much more challenging. Recent state-of-the-art requires an expensive full SVD in each iteration. In this paper, we show that for many commonly-used nonconvex low-rank regularizers, a cutoff can be derived to automatically threshold the singular values obtained from the proximal operator. This allows such operator being efficiently approximated by power method. Based on it, we develop a proximal gradient algorithm (and its accelerated variant) with inexact proximal splitting and prove that a convergence rate of O(1/T) where T is the number of iterations is guaranteed. Furthermore, we show the proposed algorithm can be well parallelized, which achieves nearly linear speedup w.r.t the number of threads. Extensive experiments are performed on matrix completion and robust principal component analysis, which shows a significant speedup over the state-of-the-art. Moreover, the matrix solution obtained is more accurate and has a lower rank than that of the nuclear norm regularizer. △ Less

Submitted 23 July, 2018; v1 submitted 31 July, 2017; originally announced August 2017.

Comments: Accepted by TPAMI in 2018 (extension of ICDM-2015 conference paper arXiv:1512.00984)

Showing 51–100 of 141 results for author: Kwok, T