Search | arXiv e-print repository

arXiv:2507.19672 [pdf, ps, other]

Alignment and Safety in Large Language Models: Safety Mechanisms, Training Paradigms, and Emerging Challenges

Authors: Haoran Lu, Luyang Fang, Ruidong Zhang, Xinliang Li, Jiazhang Cai, Huimin Cheng, Lin Tang, Ziyu Liu, Zeliang Sun, Tao Wang, Yingchuan Zhang, Arif Hassan Zidan, Jinwen Xu, Jincheng Yu, Meizhi Yu, Hanqi Jiang, Xilin Gong, Weidi Luo, Bolun Sun, Yongkai Chen, Terry Ma, Shushan Wu, Yifan Zhou, Junhao Chen, Haotian Xiang , et al. (25 additional authors not shown)

Abstract: Due to the remarkable capabilities and growing impact of large language models (LLMs), they have been deeply integrated into many aspects of society. Thus, ensuring their alignment with human values and intentions has emerged as a critical challenge. This survey provides a comprehensive overview of practical alignment techniques, training protocols, and empirical findings in LLM alignment. We anal… ▽ More Due to the remarkable capabilities and growing impact of large language models (LLMs), they have been deeply integrated into many aspects of society. Thus, ensuring their alignment with human values and intentions has emerged as a critical challenge. This survey provides a comprehensive overview of practical alignment techniques, training protocols, and empirical findings in LLM alignment. We analyze the development of alignment methods across diverse paradigms, characterizing the fundamental trade-offs between core alignment objectives. Our analysis shows that while supervised fine-tuning enables basic instruction-following, preference-based methods offer more flexibility for aligning with nuanced human intent. We discuss state-of-the-art techniques, including Direct Preference Optimization (DPO), Constitutional AI, brain-inspired methods, and alignment uncertainty quantification (AUQ), highlighting their approaches to balancing quality and efficiency. We review existing evaluation frameworks and benchmarking datasets, emphasizing limitations such as reward misspecification, distributional robustness, and scalable oversight. We summarize strategies adopted by leading AI labs to illustrate the current state of practice. We conclude by outlining open problems in oversight, value pluralism, robustness, and continuous alignment. This survey aims to inform both researchers and practitioners navigating the evolving landscape of LLM alignment. △ Less

Submitted 25 July, 2025; originally announced July 2025.

Comments: 119 pages, 10 figures, 7 tables

arXiv:2506.03347 [pdf, ps, other]

Constructing g-computation estimators: two case studies in selection bias

Authors: Paul N Zivich, Haidong Lu

Abstract: G-computation is a useful estimation method that can be adapted to address various biases in epidemiology. However, these adaptations may not be obvious for some complex causal structures. This challenge is an example of the much wider issue of translating a causal diagram into a novel estimation strategy. To highlight these challenges, we consider two recent cases from the selection bias literatu… ▽ More G-computation is a useful estimation method that can be adapted to address various biases in epidemiology. However, these adaptations may not be obvious for some complex causal structures. This challenge is an example of the much wider issue of translating a causal diagram into a novel estimation strategy. To highlight these challenges, we consider two recent cases from the selection bias literature: treatment-induced selection and co-occurrence of biases that lack a joint adjustment set. For each case study, we show how g-computation can be adapted, describe how to implement that adaptation, show some general statistical properties, and illustrate the estimator using simulation. To simplify both the theoretical study and practical application of our estimators, we express the proposed g-computation estimators as stacked estimating equations. These examples illustrate how epidemiologists can translate identification results into a g-computation estimator and study the theoretical and finite-sample properties of a novel estimator. △ Less

Submitted 31 July, 2025; v1 submitted 3 June, 2025; originally announced June 2025.

arXiv:2504.14772 [pdf, other]

Knowledge Distillation and Dataset Distillation of Large Language Models: Emerging Trends, Challenges, and Future Directions

Authors: Luyang Fang, Xiaowei Yu, Jiazhang Cai, Yongkai Chen, Shushan Wu, Zhengliang Liu, Zhenyuan Yang, Haoran Lu, Xilin Gong, Yufang Liu, Terry Ma, Wei Ruan, Ali Abbasi, Jing Zhang, Tao Wang, Ehsan Latif, Wei Liu, Wei Zhang, Soheil Kolouri, Xiaoming Zhai, Dajiang Zhu, Wenxuan Zhong, Tianming Liu, Ping Ma

Abstract: The exponential growth of Large Language Models (LLMs) continues to highlight the need for efficient strategies to meet ever-expanding computational and data demands. This survey provides a comprehensive analysis of two complementary paradigms: Knowledge Distillation (KD) and Dataset Distillation (DD), both aimed at compressing LLMs while preserving their advanced reasoning capabilities and lingui… ▽ More The exponential growth of Large Language Models (LLMs) continues to highlight the need for efficient strategies to meet ever-expanding computational and data demands. This survey provides a comprehensive analysis of two complementary paradigms: Knowledge Distillation (KD) and Dataset Distillation (DD), both aimed at compressing LLMs while preserving their advanced reasoning capabilities and linguistic diversity. We first examine key methodologies in KD, such as task-specific alignment, rationale-based training, and multi-teacher frameworks, alongside DD techniques that synthesize compact, high-impact datasets through optimization-based gradient matching, latent space regularization, and generative synthesis. Building on these foundations, we explore how integrating KD and DD can produce more effective and scalable compression strategies. Together, these approaches address persistent challenges in model scalability, architectural heterogeneity, and the preservation of emergent LLM abilities. We further highlight applications across domains such as healthcare and education, where distillation enables efficient deployment without sacrificing performance. Despite substantial progress, open challenges remain in preserving emergent reasoning and linguistic diversity, enabling efficient adaptation to continually evolving teacher models and datasets, and establishing comprehensive evaluation protocols. By synthesizing methodological innovations, theoretical foundations, and practical insights, our survey charts a path toward sustainable, resource-efficient LLMs through the tighter integration of KD and DD principles. △ Less

Submitted 20 April, 2025; originally announced April 2025.

arXiv:2502.00924 [pdf, other]

Generalized Simple Graphical Rules for Assessing Selection Bias

Authors: Yichi Zhang, Haidong Lu

Abstract: Selection bias is a major obstacle toward valid causal inference in epidemiology. Over the past decade, several simple graphical rules based on causal diagrams have been proposed as the sufficient identification conditions for addressing selection bias and recovering causal effects. However, these simple graphical rules are usually coupled with specific identification strategies and estimators. In… ▽ More Selection bias is a major obstacle toward valid causal inference in epidemiology. Over the past decade, several simple graphical rules based on causal diagrams have been proposed as the sufficient identification conditions for addressing selection bias and recovering causal effects. However, these simple graphical rules are usually coupled with specific identification strategies and estimators. In this article, we show two important cases of selection bias that cannot be addressed by these simple rules and their estimators: one case where selection is a descendant of a collider of the treatment and the outcome, and the other case where selection is affected by the mediator. To address selection bias in these two cases, we construct identification formulas by the g-computation and the inverse probability weighting (IPW) methods based on single-world intervention graphs (SWIGs). They are generalized to recover the average treatment effect by adjusting for post-treatment upstream causes of selection. We propose two IPW estimators and their variance estimators to recover the average treatment effect in the presence of selection bias in these two cases. We conduct simulation studies to verify the performance of the estimators when the traditional crude selected-sample analysis returns erroneous contradictory conclusions to the truth. △ Less

Submitted 16 February, 2025; v1 submitted 2 February, 2025; originally announced February 2025.

arXiv:2410.10912 [pdf, other]

AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models

Authors: Haiquan Lu, Yefan Zhou, Shiwei Liu, Zhangyang Wang, Michael W. Mahoney, Yaoqing Yang

Abstract: Recent work on pruning large language models (LLMs) has shown that one can eliminate a large number of parameters without compromising performance, making pruning a promising strategy to reduce LLM model size. Existing LLM pruning strategies typically assign uniform pruning ratios across layers, limiting overall pruning ability; and recent work on layerwise pruning of LLMs is often based on heuris… ▽ More Recent work on pruning large language models (LLMs) has shown that one can eliminate a large number of parameters without compromising performance, making pruning a promising strategy to reduce LLM model size. Existing LLM pruning strategies typically assign uniform pruning ratios across layers, limiting overall pruning ability; and recent work on layerwise pruning of LLMs is often based on heuristics that can easily lead to suboptimal performance. In this paper, we leverage Heavy-Tailed Self-Regularization (HT-SR) Theory, in particular the shape of empirical spectral densities (ESDs) of weight matrices, to design improved layerwise pruning ratios for LLMs. Our analysis reveals a wide variability in how well-trained, and thus relatedly how prunable, different layers of an LLM are. Based on this, we propose AlphaPruning, which uses shape metrics to allocate layerwise sparsity ratios in a more theoretically principled manner. AlphaPruning can be used in conjunction with multiple existing LLM pruning methods. Our empirical results show that AlphaPruning prunes LLaMA-7B to 80% sparsity while maintaining reasonable perplexity, marking a first in the literature on LLMs. We have open-sourced our code at https://github.com/haiquanlu/AlphaPruning. △ Less

Submitted 13 October, 2024; originally announced October 2024.

Comments: NeurIPS 2024, first two authors contributed equally

arXiv:2408.04081 [pdf]

A Framework for Assessing Cumulative Exposure to Extreme Temperatures During Transit Trip

Authors: Huiying Fan, Hongyu Lu, Geyu Lyu, Angshuman Guin, Randall Guensler

Abstract: The combined influence of urban heat islands, climate change, and extreme temperature events are increasingly impacting transit travelers, especially vulnerable populations such as older adults, people with disabilities, and those with chronic diseases. Previous studies have generally attempted to address this issue at either the micro- or macro-level, but each approach presents different limitati… ▽ More The combined influence of urban heat islands, climate change, and extreme temperature events are increasingly impacting transit travelers, especially vulnerable populations such as older adults, people with disabilities, and those with chronic diseases. Previous studies have generally attempted to address this issue at either the micro- or macro-level, but each approach presents different limitations in modeling the impacts on transit trips. Other research proposes a meso-level approach to address some of these gaps, but the use of additive exposure calculation and spatial shortest path routing poses constraints meso-modeling accuracy. This study introduces HeatPath Analyzer, a framework to assess the exposure of transit riders to extreme temperatures, using TransitSim 4.0 to generate second-by-second spatio-temporal trip trajectories, the traveler activity profiles, and thermal comfort levels along the entire journey. The approach uses heat stress combines the standards proposed by the NWS and CDC to estimate cumulative exposure for transit riders, with specific parameters tailored to the elderly and people with disabilities. The framework assesses the influence of extreme heat and winter chill. A case study in Atlanta, GA, reveals that 10.2% of trips on an average summer weekday in 2019 were at risk of extreme heat. The results uncover exposure disparities across different transit trip mode segments, and across mitigation-based and adaptation-based strategies. While the mitigation-based strategy highlights high-exposure segments such as long ingress and egress, adaptation should be prioritized toward the middle or second half of the trip when a traveler is waiting for transit or transferring between routes. A comparison between the traditional additive approach and the dynamic approach presented also shows significant disparities, which, if overlooked, can mislead policy decisions. △ Less

Submitted 7 August, 2024; originally announced August 2024.

Comments: 44 pages, 1 table, 8 figures

arXiv:2407.12996 [pdf, other]

Sharpness-diversity tradeoff: improving flat ensembles with SharpBalance

Authors: Haiquan Lu, Xiaotian Liu, Yefan Zhou, Qunli Li, Kurt Keutzer, Michael W. Mahoney, Yujun Yan, Huanrui Yang, Yaoqing Yang

Abstract: Recent studies on deep ensembles have identified the sharpness of the local minima of individual learners and the diversity of the ensemble members as key factors in improving test-time performance. Building on this, our study investigates the interplay between sharpness and diversity within deep ensembles, illustrating their crucial role in robust generalization to both in-distribution (ID) and o… ▽ More Recent studies on deep ensembles have identified the sharpness of the local minima of individual learners and the diversity of the ensemble members as key factors in improving test-time performance. Building on this, our study investigates the interplay between sharpness and diversity within deep ensembles, illustrating their crucial role in robust generalization to both in-distribution (ID) and out-of-distribution (OOD) data. We discover a trade-off between sharpness and diversity: minimizing the sharpness in the loss landscape tends to diminish the diversity of individual members within the ensemble, adversely affecting the ensemble's improvement. The trade-off is justified through our theoretical analysis and verified empirically through extensive experiments. To address the issue of reduced diversity, we introduce SharpBalance, a novel training approach that balances sharpness and diversity within ensembles. Theoretically, we show that our training strategy achieves a better sharpness-diversity trade-off. Empirically, we conducted comprehensive evaluations in various data sets (CIFAR-10, CIFAR-100, TinyImageNet) and showed that SharpBalance not only effectively improves the sharpness-diversity trade-off, but also significantly improves ensemble performance in ID and OOD scenarios. △ Less

Submitted 17 July, 2024; originally announced July 2024.

arXiv:2401.08159 [pdf, other]

Reluctant Interaction Modeling in Generalized Linear Models

Authors: Hai Lu, Guo Yu

Abstract: While including pairwise interactions in a regression model can better approximate response surface, fitting such an interaction model is a well-known difficult problem. In particular, analyzing contemporary high-dimensional datasets often leads to extremely large-scale interaction modeling problem, where the challenge is posed to identify important interactions among millions or even billions of… ▽ More While including pairwise interactions in a regression model can better approximate response surface, fitting such an interaction model is a well-known difficult problem. In particular, analyzing contemporary high-dimensional datasets often leads to extremely large-scale interaction modeling problem, where the challenge is posed to identify important interactions among millions or even billions of candidate interactions. While several methods have recently been proposed to tackle this challenge, they are mostly designed by (1) assuming the hierarchy assumption among the important interactions and (or) (2) focusing on the case in linear models with interactions and (sub)Gaussian errors. In practice, however, neither of these two building blocks has to hold. In this paper, we propose an interaction modeling framework in generalized linear models (GLMs) which is free of any assumptions on hierarchy. We develop a non-trivial extension of the reluctance interaction selection principle to the GLMs setting, where a main effect is preferred over an interaction if all else is equal. Our proposed method is easy to implement, and is highly scalable to large-scale datasets. Theoretically, we demonstrate that it possesses screening consistency under high-dimensional setting. Numerical studies on simulated datasets and a real dataset show that the proposed method does not sacrifice statistical performance in the presence of significant computational gain. △ Less

Submitted 16 January, 2024; originally announced January 2024.

Comments: 41 pages

arXiv:2310.07999 [pdf, other]

LEMON: Lossless model expansion

Authors: Yite Wang, Jiahao Su, Hanlin Lu, Cong Xie, Tianyi Liu, Jianbo Yuan, Haibin Lin, Ruoyu Sun, Hongxia Yang

Abstract: Scaling of deep neural networks, especially Transformers, is pivotal for their surging performance and has further led to the emergence of sophisticated reasoning capabilities in foundation models. Such scaling generally requires training large models from scratch with random initialization, failing to leverage the knowledge acquired by their smaller counterparts, which are already resource-intens… ▽ More Scaling of deep neural networks, especially Transformers, is pivotal for their surging performance and has further led to the emergence of sophisticated reasoning capabilities in foundation models. Such scaling generally requires training large models from scratch with random initialization, failing to leverage the knowledge acquired by their smaller counterparts, which are already resource-intensive to obtain. To tackle this inefficiency, we present $\textbf{L}$ossl$\textbf{E}$ss $\textbf{MO}$del Expansio$\textbf{N}$ (LEMON), a recipe to initialize scaled models using the weights of their smaller but pre-trained counterparts. This is followed by model training with an optimized learning rate scheduler tailored explicitly for the scaled models, substantially reducing the training time compared to training from scratch. Notably, LEMON is versatile, ensuring compatibility with various network structures, including models like Vision Transformers and BERT. Our empirical results demonstrate that LEMON reduces computational costs by 56.7% for Vision Transformers and 33.2% for BERT when compared to training from scratch. △ Less

Submitted 11 October, 2023; originally announced October 2023.

Comments: Preprint

arXiv:2304.04692 [pdf, other]

Scalable Randomized Kernel Methods for Multiview Data Integration and Prediction

Authors: Sandra E. Safo, Han Lu

Abstract: We develop scalable randomized kernel methods for jointly associating data from multiple sources and simultaneously predicting an outcome or classifying a unit into one of two or more classes. The proposed methods model nonlinear relationships in multiview data together with predicting a clinical outcome and are capable of identifying variables or groups of variables that best contribute to the re… ▽ More We develop scalable randomized kernel methods for jointly associating data from multiple sources and simultaneously predicting an outcome or classifying a unit into one of two or more classes. The proposed methods model nonlinear relationships in multiview data together with predicting a clinical outcome and are capable of identifying variables or groups of variables that best contribute to the relationships among the views. We use the idea that random Fourier bases can approximate shift-invariant kernel functions to construct nonlinear mappings of each view and we use these mappings and the outcome variable to learn view-independent low-dimensional representations. Through simulation studies, we show that the proposed methods outperform several other linear and nonlinear methods for multiview data integration. When the proposed methods were applied to gene expression, metabolomics, proteomics, and lipidomics data pertaining to COVID-19, we identified several molecular signatures forCOVID-19 status and severity. Results from our real data application and simulations with small sample sizes suggest that the proposed methods may be useful for small sample size problems. Availability: Our algorithms are implemented in Pytorch and interfaced in R and would be made available at: https://github.com/lasandrall/RandMVLearn. △ Less

Submitted 10 April, 2023; originally announced April 2023.

Comments: 24 pages, 5 figures, 4 tables

arXiv:2303.05399 [pdf, ps, other]

Practical Statistical Considerations for the Clinical Validation of AI/ML-enabled Medical Diagnostic Devices

Authors: Feiming Chen, Hong Laura Lu, Arianna Simonetti

Abstract: Artificial Intelligence (AI) and Machine-Learning (ML) models have been increasingly used in medical products, such as medical device software. General considerations on the statistical aspects for the evaluation of AI/ML-enabled medical diagnostic devices are discussed in this paper. We also provide relevant academic references and note good practices in addressing various statistical challenges… ▽ More Artificial Intelligence (AI) and Machine-Learning (ML) models have been increasingly used in medical products, such as medical device software. General considerations on the statistical aspects for the evaluation of AI/ML-enabled medical diagnostic devices are discussed in this paper. We also provide relevant academic references and note good practices in addressing various statistical challenges in the clinical validation of AI/ML-enabled medical devices in the context of their intended use. △ Less

Submitted 2 March, 2023; originally announced March 2023.

Comments: 20 pages, 1 table

arXiv:2302.07930 [pdf, other]

doi 10.1186/s12859-024-05679-9

Interpretable Deep Learning Methods for Multiview Learning

Authors: Hengkang Wang, Han Lu, Ju Sun, Sandra E Safo

Abstract: Technological advances have enabled the generation of unique and complementary types of data or views (e.g. genomics, proteomics, metabolomics) and opened up a new era in multiview learning research with the potential to lead to new biomedical discoveries. We propose iDeepViewLearn (Interpretable Deep Learning Method for Multiview Learning) for learning nonlinear relationships in data from multipl… ▽ More Technological advances have enabled the generation of unique and complementary types of data or views (e.g. genomics, proteomics, metabolomics) and opened up a new era in multiview learning research with the potential to lead to new biomedical discoveries. We propose iDeepViewLearn (Interpretable Deep Learning Method for Multiview Learning) for learning nonlinear relationships in data from multiple views while achieving feature selection. iDeepViewLearn combines deep learning flexibility with the statistical benefits of data and knowledge-driven feature selection, giving interpretable results. Deep neural networks are used to learn view-independent low-dimensional embedding through an optimization problem that minimizes the difference between observed and reconstructed data, while imposing a regularization penalty on the reconstructed data. The normalized Laplacian of a graph is used to model bilateral relationships between variables in each view, therefore, encouraging selection of related variables. iDeepViewLearn is tested on simulated and two real-world data, including breast cancer-related gene expression and methylation data. iDeepViewLearn had competitive classification results and identified genes and CpG sites that differentiated between individuals who died from breast cancer and those who did not. The results of our real data application and simulations with small to moderate sample sizes suggest that iDeepViewLearn may be a useful method for small-sample-size problems compared to other deep learning methods for multiview learning. △ Less

Submitted 15 February, 2024; v1 submitted 15 February, 2023; originally announced February 2023.

Comments: Published in BMC Bioinformatics (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05679-9)

Journal ref: BMC Bioinformatics 25, 69 (2024)

arXiv:2211.16509 [pdf, other]

doi 10.1142/S2811032322500047

Multimodal Learning for Multi-Omics: A Survey

Authors: Sina Tabakhi, Mohammod Naimul Islam Suvon, Pegah Ahadian, Haiping Lu

Abstract: With advanced imaging, sequencing, and profiling technologies, multiple omics data become increasingly available and hold promises for many healthcare applications such as cancer diagnosis and treatment. Multimodal learning for integrative multi-omics analysis can help researchers and practitioners gain deep insights into human diseases and improve clinical decisions. However, several challenges a… ▽ More With advanced imaging, sequencing, and profiling technologies, multiple omics data become increasingly available and hold promises for many healthcare applications such as cancer diagnosis and treatment. Multimodal learning for integrative multi-omics analysis can help researchers and practitioners gain deep insights into human diseases and improve clinical decisions. However, several challenges are hindering the development in this area, including the availability of easily accessible open-source tools. This survey aims to provide an up-to-date overview of the data challenges, fusion approaches, datasets, and software tools from several new perspectives. We identify and investigate various omics data challenges that can help us understand the field better. We categorize fusion approaches comprehensively to cover existing methods in this area. We collect existing open-source tools to facilitate their broader utilization and development. We explore a broad range of omics data modalities and a list of accessible datasets. Finally, we summarize future directions that can potentially address existing gaps and answer the pressing need to advance multimodal learning for multi-omics data analysis. △ Less

Submitted 19 December, 2022; v1 submitted 29 November, 2022; originally announced November 2022.

Comments: 52 pages, 3 figures; Revised matrix factorization fusion section

arXiv:2112.02180 [pdf, other]

Generalized Transitional Markov Chain Monte Carlo Sampling Technique for Bayesian Inversion

Authors: Han Lu, Mohammad Khalil, Thomas Catanach, Jiefu Chen, Xuqing Wu, Xin Fu, Cosmin Safta, Yueqin Huang

Abstract: In the context of Bayesian inversion for scientific and engineering modeling, Markov chain Monte Carlo sampling strategies are the benchmark due to their flexibility and robustness in dealing with arbitrary posterior probability density functions (PDFs). However, these algorithms been shown to be inefficient when sampling from posterior distributions that are high-dimensional or exhibit multi-moda… ▽ More In the context of Bayesian inversion for scientific and engineering modeling, Markov chain Monte Carlo sampling strategies are the benchmark due to their flexibility and robustness in dealing with arbitrary posterior probability density functions (PDFs). However, these algorithms been shown to be inefficient when sampling from posterior distributions that are high-dimensional or exhibit multi-modality and/or strong parameter correlations. In such contexts, the sequential Monte Carlo technique of transitional Markov chain Monte Carlo (TMCMC) provides a more efficient alternative. Despite the recent applicability for Bayesian updating and model selection across a variety of disciplines, TMCMC may require a prohibitive number of tempering stages when the prior PDF is significantly different from the target posterior. Furthermore, the need to start with an initial set of samples from the prior distribution may present a challenge when dealing with implicit priors, e.g. based on feasible regions. Finally, TMCMC can not be used for inverse problems with improper prior PDFs that represent lack of prior knowledge on all or a subset of parameters. In this investigation, a generalization of TMCMC that alleviates such challenges and limitations is proposed, resulting in a tempering sampling strategy of enhanced robustness and computational efficiency. Convergence analysis of the proposed sequential Monte Carlo algorithm is presented, proving that the distance between the intermediate distributions and the target posterior distribution monotonically decreases as the algorithm proceeds. The enhanced efficiency associated with the proposed generalization is highlighted through a series of test inverse problems and an engineering application in the oil and gas industry. △ Less

Submitted 3 December, 2021; originally announced December 2021.

arXiv:2106.09756 [pdf, other]

PyKale: Knowledge-Aware Machine Learning from Multiple Sources in Python

Authors: Haiping Lu, Xianyuan Liu, Robert Turner, Peizhen Bai, Raivo E Koot, Shuo Zhou, Mustafa Chasmai, Lawrence Schobs

Abstract: Machine learning is a general-purpose technology holding promises for many interdisciplinary research problems. However, significant barriers exist in crossing disciplinary boundaries when most machine learning tools are developed in different areas separately. We present Pykale - a Python library for knowledge-aware machine learning on graphs, images, texts, and videos to enable and accelerate in… ▽ More Machine learning is a general-purpose technology holding promises for many interdisciplinary research problems. However, significant barriers exist in crossing disciplinary boundaries when most machine learning tools are developed in different areas separately. We present Pykale - a Python library for knowledge-aware machine learning on graphs, images, texts, and videos to enable and accelerate interdisciplinary research. We formulate new green machine learning guidelines based on standard software engineering practices and propose a novel pipeline-based application programming interface (API). PyKale focuses on leveraging knowledge from multiple sources for accurate and interpretable prediction, thus supporting multimodal learning and transfer learning (particularly domain adaptation) with latest deep learning and dimensionality reduction models. We build PyKale on PyTorch and leverage the rich PyTorch ecosystem. Our pipeline-based API design enforces standardization and minimalism, embracing green machine learning concepts via reducing repetitions and redundancy, reusing existing resources, and recycling learning models across areas. We demonstrate its interdisciplinary nature via examples in bioinformatics, knowledge graph, image/video recognition, and medical imaging. △ Less

Submitted 17 June, 2021; originally announced June 2021.

Comments: This library is available at https://github.com/pykale/pykale

arXiv:2103.11539 [pdf, other]

Interpretable, predictive spatio-temporal models via enhanced Pairwise Directions Estimation

Authors: Heng-Hui Lue, ShengLi Tzeng

Abstract: This article concerns the predictive modeling for spatio-temporal data as well as model interpretation using data information in space and time. We develop a novel approach based on supervised dimension reduction for such data in order to capture nonlinear mean structures without requiring a prespecified parametric model. In addition to prediction as a common interest, this approach emphasizes the… ▽ More This article concerns the predictive modeling for spatio-temporal data as well as model interpretation using data information in space and time. We develop a novel approach based on supervised dimension reduction for such data in order to capture nonlinear mean structures without requiring a prespecified parametric model. In addition to prediction as a common interest, this approach emphasizes the exploration of geometric information from the data. The method of Pairwise Directions Estimation (PDE; Lue, 2019) is implemented in our approach as a data-driven function searching for spatial patterns and temporal trends. The benefit of using geometric information from the method of PDE is highlighted, which aids effectively in exploring data structures. We further enhance PDE, referring to it as PDE+, by incorporating kriging to estimate the random effects not explained in the mean functions. Our proposal can not only increase prediction accuracy, but also improve the interpretation for modeling. Two simulation examples are conducted and comparisons are made with four existing methods. The results demonstrate that the proposed PDE+ method is very useful for exploring and interpreting the patterns and trends for spatio-temporal data. Illustrative applications to two real datasets are also presented. △ Less

Submitted 6 November, 2021; v1 submitted 21 March, 2021; originally announced March 2021.

Comments: 18 pages, 4 figures

arXiv:2102.03607 [pdf, other]

Bootstrapping Fitted Q-Evaluation for Off-Policy Inference

Authors: Botao Hao, Xiang Ji, Yaqi Duan, Hao Lu, Csaba Szepesvári, Mengdi Wang

Abstract: Bootstrapping provides a flexible and effective approach for assessing the quality of batch reinforcement learning, yet its theoretical property is less understood. In this paper, we study the use of bootstrapping in off-policy evaluation (OPE), and in particular, we focus on the fitted Q-evaluation (FQE) that is known to be minimax-optimal in the tabular and linear-model cases. We propose a boots… ▽ More Bootstrapping provides a flexible and effective approach for assessing the quality of batch reinforcement learning, yet its theoretical property is less understood. In this paper, we study the use of bootstrapping in off-policy evaluation (OPE), and in particular, we focus on the fitted Q-evaluation (FQE) that is known to be minimax-optimal in the tabular and linear-model cases. We propose a bootstrapping FQE method for inferring the distribution of the policy evaluation error and show that this method is asymptotically efficient and distributionally consistent for off-policy statistical inference. To overcome the computation limit of bootstrapping, we further adapt a subsampling procedure that improves the runtime by an order of magnitude. We numerically evaluate the bootrapping method in classical RL environments for confidence interval estimation, estimating the variance of off-policy evaluator, and estimating the correlation between multiple off-policy evaluators. △ Less

Submitted 22 May, 2022; v1 submitted 6 February, 2021; originally announced February 2021.

Comments: Accepted at ICML 2021

arXiv:2009.08685 [pdf, other]

GrateTile: Efficient Sparse Tensor Tiling for CNN Processing

Authors: Yu-Sheng Lin, Hung Chang Lu, Yang-Bin Tsao, Yi-Min Chih, Wei-Chao Chen, Shao-Yi Chien

Abstract: We propose GrateTile, an efficient, hardwarefriendly data storage scheme for sparse CNN feature maps (activations). It divides data into uneven-sized subtensors and, with small indexing overhead, stores them in a compressed yet randomly accessible format. This design enables modern CNN accelerators to fetch and decompressed sub-tensors on-the-fly in a tiled processing manner. GrateTile is suitable… ▽ More We propose GrateTile, an efficient, hardwarefriendly data storage scheme for sparse CNN feature maps (activations). It divides data into uneven-sized subtensors and, with small indexing overhead, stores them in a compressed yet randomly accessible format. This design enables modern CNN accelerators to fetch and decompressed sub-tensors on-the-fly in a tiled processing manner. GrateTile is suitable for architectures that favor aligned, coalesced data access, and only requires minimal changes to the overall architectural design. We simulate GrateTile with state-of-the-art CNNs and show an average of 55% DRAM bandwidth reduction while using only 0.6% of feature map size for indexing storage. △ Less

Submitted 18 September, 2020; originally announced September 2020.

Comments: To be published at IEEE Workshop on Signal Processing System (SiPS 2020)

arXiv:2007.12375 [pdf, other]

Impact of Medical Data Imprecision on Learning Results

Authors: Mei Wang, Jianwen Su, Haiqin Lu

Abstract: Test data measured by medical instruments often carry imprecise ranges that include the true values. The latter are not obtainable in virtually all cases. Most learning algorithms, however, carry out arithmetical calculations that are subject to uncertain influence in both the learning process to obtain models and applications of the learned models in, e.g. prediction. In this paper, we initiate a… ▽ More Test data measured by medical instruments often carry imprecise ranges that include the true values. The latter are not obtainable in virtually all cases. Most learning algorithms, however, carry out arithmetical calculations that are subject to uncertain influence in both the learning process to obtain models and applications of the learned models in, e.g. prediction. In this paper, we initiate a study on the impact of imprecision on prediction results in a healthcare application where a pre-trained model is used to predict future state of hyperthyroidism for patients. We formulate a model for data imprecisions. Using parameters to control the degree of imprecision, imprecise samples for comparison experiments can be generated using this model. Further, a group of measures are defined to evaluate the different impacts quantitatively. More specifically, the statistics to measure the inconsistent prediction for individual patients are defined. We perform experimental evaluations to compare prediction results based on the data from the original dataset and the corresponding ones generated from the proposed precision model using the long-short-term memories (LSTM) network. The results against a real world hyperthyroidism dataset provide insights into how small imprecisions can cause large ranges of predicted results, which could cause mis-labeling and inappropriate actions (treatments or no treatments) for individual patients. △ Less

Submitted 24 July, 2020; originally announced July 2020.

Comments: 2020 KDD Workshop on Applied Data Science for Healthcare

arXiv:2007.02977 [pdf, other]

Sharing Models or Coresets: A Study based on Membership Inference Attack

Authors: Hanlin Lu, Changchang Liu, Ting He, Shiqiang Wang, Kevin S. Chan

Abstract: Distributed machine learning generally aims at training a global model based on distributed data without collecting all the data to a centralized location, where two different approaches have been proposed: collecting and aggregating local models (federated learning) and collecting and training over representative data summaries (coreset). While each approach preserves data privacy to some extent… ▽ More Distributed machine learning generally aims at training a global model based on distributed data without collecting all the data to a centralized location, where two different approaches have been proposed: collecting and aggregating local models (federated learning) and collecting and training over representative data summaries (coreset). While each approach preserves data privacy to some extent thanks to not sharing the raw data, the exact extent of protection is unclear under sophisticated attacks that try to infer the raw data from the shared information. We present the first comparison between the two approaches in terms of target model accuracy, communication cost, and data privacy, where the last is measured by the accuracy of a state-of-the-art attack strategy called the membership inference attack. Our experiments quantify the accuracy-privacy-cost tradeoff of each approach, and reveal a nontrivial comparison that can be used to guide the design of model training processes. △ Less

Submitted 6 July, 2020; originally announced July 2020.

arXiv:2006.09815 [pdf, other]

A Deep Neural Network for Audio Classification with a Classifier Attention Mechanism

Authors: Haoye Lu, Haolong Zhang, Amit Nayak

Abstract: Audio classification is considered as a challenging problem in pattern recognition. Recently, many algorithms have been proposed using deep neural networks. In this paper, we introduce a new attention-based neural network architecture called Classifier-Attention-Based Convolutional Neural Network (CAB-CNN). The algorithm uses a newly designed architecture consisting of a list of simple classifiers… ▽ More Audio classification is considered as a challenging problem in pattern recognition. Recently, many algorithms have been proposed using deep neural networks. In this paper, we introduce a new attention-based neural network architecture called Classifier-Attention-Based Convolutional Neural Network (CAB-CNN). The algorithm uses a newly designed architecture consisting of a list of simple classifiers and an attention mechanism as a classifier selector. This design significantly reduces the number of parameters required by the classifiers and thus their complexities. In this way, it becomes easier to train the classifiers and achieve a high and steady performance. Our claims are corroborated by the experimental results. Compared to the state-of-the-art algorithms, our algorithm achieves more than 10% improvements on all selected test scores. △ Less

Submitted 14 June, 2020; originally announced June 2020.

arXiv:2006.08667 [pdf, other]

The Landscape of the Proximal Point Method for Nonconvex-Nonconcave Minimax Optimization

Authors: Benjamin Grimmer, Haihao Lu, Pratik Worah, Vahab Mirrokni

Abstract: Minimax optimization has become a central tool in machine learning with applications in robust optimization, reinforcement learning, GANs, etc. These applications are often nonconvex-nonconcave, but the existing theory is unable to identify and deal with the fundamental difficulties this poses. In this paper, we study the classic proximal point method (PPM) applied to nonconvex-nonconcave minimax… ▽ More Minimax optimization has become a central tool in machine learning with applications in robust optimization, reinforcement learning, GANs, etc. These applications are often nonconvex-nonconcave, but the existing theory is unable to identify and deal with the fundamental difficulties this poses. In this paper, we study the classic proximal point method (PPM) applied to nonconvex-nonconcave minimax problems. We find that a classic generalization of the Moreau envelope by Attouch and Wets provides key insights. Critically, we show this envelope not only smooths the objective but can convexify and concavify it based on the level of interaction present between the minimizing and maximizing variables. From this, we identify three distinct regions of nonconvex-nonconcave problems. When interaction is sufficiently strong, we derive global linear convergence guarantees. Conversely when the interaction is fairly weak, we derive local linear convergence guarantees with a proper initialization. Between these two settings, we show that PPM may diverge or converge to a limit cycle. △ Less

Submitted 1 April, 2021; v1 submitted 15 June, 2020; originally announced June 2020.

Comments: Notably updated version that connects our theory with that of Attouch and Wets from the 80s and notably expands on our first posting to apply to generic minimax problems (rather than requiring bilinear interaction)

MSC Class: 65K05; 65K10; 90C26; 90C15; 90C30

arXiv:2006.00038 [pdf, other]

Quasi-orthonormal Encoding for Machine Learning Applications

Authors: Haw-minn Lu

Abstract: Most machine learning models, especially artificial neural networks, require numerical, not categorical data. We briefly describe the advantages and disadvantages of common encoding schemes. For example, one-hot encoding is commonly used for attributes with a few unrelated categories and word embeddings for attributes with many related categories (e.g., words). Neither is suitable for encoding att… ▽ More Most machine learning models, especially artificial neural networks, require numerical, not categorical data. We briefly describe the advantages and disadvantages of common encoding schemes. For example, one-hot encoding is commonly used for attributes with a few unrelated categories and word embeddings for attributes with many related categories (e.g., words). Neither is suitable for encoding attributes with many unrelated categories, such as diagnosis codes in healthcare applications. Application of one-hot encoding for diagnosis codes, for example, can result in extremely high dimensionality with low sample size problems or artificially induce machine learning artifacts, not to mention the explosion of computing resources needed. Quasi-orthonormal encoding (QOE) fills the gap. We briefly show how QOE compares to one-hot encoding. We provide example code of how to implement QOE using popular ML libraries such as Tensorflow and PyTorch and a demonstration of QOE to MNIST handwriting samples. △ Less

Submitted 29 May, 2020; originally announced June 2020.

Comments: Accepted and submitted to 19th Python in Science Conference. (SciPy 2020)

arXiv:2004.03104 [pdf, other]

Generalized Label Enhancement with Sample Correlations

Authors: Qinghai Zheng, Jihua Zhu, Haoyu Tang, Xinyuan Liu, Zhongyu Li, Huimin Lu

Abstract: Recently, label distribution learning (LDL) has drawn much attention in machine learning, where LDL model is learned from labelel instances. Different from single-label and multi-label annotations, label distributions describe the instance by multiple labels with different intensities and accommodate to more general scenes. Since most existing machine learning datasets merely provide logical label… ▽ More Recently, label distribution learning (LDL) has drawn much attention in machine learning, where LDL model is learned from labelel instances. Different from single-label and multi-label annotations, label distributions describe the instance by multiple labels with different intensities and accommodate to more general scenes. Since most existing machine learning datasets merely provide logical labels, label distributions are unavailable in many real-world applications. To handle this problem, we propose two novel label enhancement methods, i.e., Label Enhancement with Sample Correlations (LESC) and generalized Label Enhancement with Sample Correlations (gLESC). More specifically, LESC employs a low-rank representation of samples in the feature space, and gLESC leverages a tensor multi-rank minimization to further investigate the sample correlations in both the feature space and label space. Benefitting from the sample correlations, the proposed methods can boost the performance of label enhancement. Extensive experiments on 14 benchmark datasets demonstrate the effectiveness and superiority of our methods. △ Less

Submitted 11 April, 2021; v1 submitted 6 April, 2020; originally announced April 2020.

arXiv:2003.11723 [pdf, other]

Learning transferable and discriminative features for unsupervised domain adaptation

Authors: Yuntao Du, Ruiting Zhang, Xiaowen Zhang, Yirong Yao, Hengyang Lu, Chongjun Wang

Abstract: Although achieving remarkable progress, it is very difficult to induce a supervised classifier without any labeled data. Unsupervised domain adaptation is able to overcome this challenge by transferring knowledge from a labeled source domain to an unlabeled target domain. Transferability and discriminability are two key criteria for characterizing the superiority of feature representations to enab… ▽ More Although achieving remarkable progress, it is very difficult to induce a supervised classifier without any labeled data. Unsupervised domain adaptation is able to overcome this challenge by transferring knowledge from a labeled source domain to an unlabeled target domain. Transferability and discriminability are two key criteria for characterizing the superiority of feature representations to enable successful domain adaptation. In this paper, a novel method called \textit{learning TransFerable and Discriminative Features for unsupervised domain adaptation} (TFDF) is proposed to optimize these two objectives simultaneously. On the one hand, distribution alignment is performed to reduce domain discrepancy and learn more transferable representations. Instead of adopting \textit{Maximum Mean Discrepancy} (MMD) which only captures the first-order statistical information to measure distribution discrepancy, we adopt a recently proposed statistic called \textit{Maximum Mean and Covariance Discrepancy} (MMCD), which can not only capture the first-order statistical information but also capture the second-order statistical information in the reproducing kernel Hilbert space (RKHS). On the other hand, we propose to explore both local discriminative information via manifold regularization and global discriminative information via minimizing the proposed \textit{class confusion} objective to learn more discriminative features, respectively. We integrate these two objectives into the \textit{Structural Risk Minimization} (RSM) framework and learn a domain-invariant classifier. Comprehensive experiments are conducted on five real-world datasets and the results verify the effectiveness of the proposed method. △ Less

Submitted 25 June, 2021; v1 submitted 25 March, 2020; originally announced March 2020.

Comments: Accepted by IDA journal

arXiv:2002.08338 [pdf, ps, other]

Multiple Imputation with Denoising Autoencoder using Metamorphic Truth and Imputation Feedback

Authors: Haw-minn Lu, Giancarlo Perrone, José Unpingco

Abstract: Although data may be abundant, complete data is less so, due to missing columns or rows. This missingness undermines the performance of downstream data products that either omit incomplete cases or create derived completed data for subsequent processing. Appropriately managing missing data is required in order to fully exploit and correctly use data. We propose a Multiple Imputation model using De… ▽ More Although data may be abundant, complete data is less so, due to missing columns or rows. This missingness undermines the performance of downstream data products that either omit incomplete cases or create derived completed data for subsequent processing. Appropriately managing missing data is required in order to fully exploit and correctly use data. We propose a Multiple Imputation model using Denoising Autoencoders to learn the internal representation of data. Furthermore, we use the novel mechanisms of Metamorphic Truth and Imputation Feedback to maintain statistical integrity of attributes and eliminate bias in the learning process. Our approach explores the effects of imputation on various missingness mechanisms and patterns of missing data, outperforming other methods in many standard test cases. △ Less

Submitted 24 June, 2020; v1 submitted 19 February, 2020; originally announced February 2020.

Comments: Machine Learning and Data Mining in Pattern Recognition, 16th International Conference on Machine Learning and Data Mining, MLDM 2020, Amsterdam, The Netherlands, July 20-21, 2020, Proceedings, pp. 197-208

arXiv:2001.10516 [pdf, other]

Tri-graph Information Propagation for Polypharmacy Side Effect Prediction

Authors: Hao Xu, Shengqi Sang, Haiping Lu

Abstract: The use of drug combinations often leads to polypharmacy side effects (POSE). A recent method formulates POSE prediction as a link prediction problem on a graph of drugs and proteins, and solves it with Graph Convolutional Networks (GCNs). However, due to the complex relationships in POSE, this method has high computational cost and memory demand. This paper proposes a flexible Tri-graph Informati… ▽ More The use of drug combinations often leads to polypharmacy side effects (POSE). A recent method formulates POSE prediction as a link prediction problem on a graph of drugs and proteins, and solves it with Graph Convolutional Networks (GCNs). However, due to the complex relationships in POSE, this method has high computational cost and memory demand. This paper proposes a flexible Tri-graph Information Propagation (TIP) model that operates on three subgraphs to learn representations progressively by propagation from protein-protein graph to drug-drug graph via protein-drug graph. Experiments show that TIP improves accuracy by 7%+, time efficiency by 83$\times$, and space efficiency by 3$\times$. △ Less

Submitted 28 January, 2020; originally announced January 2020.

Comments: Presented at NeruIPS 2019 Graph Representation Learning Workshop

arXiv:1911.11185 [pdf, other]

Theory-based Causal Transfer: Integrating Instance-level Induction and Abstract-level Structure Learning

Authors: Mark Edmonds, Xiaojian Ma, Siyuan Qi, Yixin Zhu, Hongjing Lu, Song-Chun Zhu

Abstract: Learning transferable knowledge across similar but different settings is a fundamental component of generalized intelligence. In this paper, we approach the transfer learning challenge from a causal theory perspective. Our agent is endowed with two basic yet general theories for transfer learning: (i) a task shares a common abstract structure that is invariant across domains, and (ii) the behavior… ▽ More Learning transferable knowledge across similar but different settings is a fundamental component of generalized intelligence. In this paper, we approach the transfer learning challenge from a causal theory perspective. Our agent is endowed with two basic yet general theories for transfer learning: (i) a task shares a common abstract structure that is invariant across domains, and (ii) the behavior of specific features of the environment remain constant across domains. We adopt a Bayesian perspective of causal theory induction and use these theories to transfer knowledge between environments. Given these general theories, the goal is to train an agent by interactively exploring the problem space to (i) discover, form, and transfer useful abstract and structural knowledge, and (ii) induce useful knowledge from the instance-level attributes observed in the environment. A hierarchy of Bayesian structures is used to model abstract-level structural causal knowledge, and an instance-level associative learning scheme learns which specific objects can be used to induce state changes through interaction. This model-learning scheme is then integrated with a model-based planner to achieve a task in the OpenLock environment, a virtual ``escape room'' with a complex hierarchy that requires agents to reason about an abstract, generalized causal structure. We compare performances against a set of predominate model-free reinforcement learning(RL) algorithms. RL agents showed poor ability transferring learned knowledge across different trials. Whereas the proposed model revealed similar performance trends as human learners, and more importantly, demonstrated transfer behavior across trials and learning situations. △ Less

Submitted 25 November, 2019; originally announced November 2019.

Comments: Accepted to AAAI 2020 as an oral

arXiv:1909.03835 [pdf, other]

Data Sanity Check for Deep Learning Systems via Learnt Assertions

Authors: Haochuan Lu, Huanlin Xu, Nana Liu, Yangfan Zhou, Xin Wang

Abstract: Reliability is a critical consideration to DL-based systems. But the statistical nature of DL makes it quite vulnerable to invalid inputs, i.e., those cases that are not considered in the training phase of a DL model. This paper proposes to perform data sanity check to identify invalid inputs, so as to enhance the reliability of DL-based systems. We design and implement a tool to detect behavior d… ▽ More Reliability is a critical consideration to DL-based systems. But the statistical nature of DL makes it quite vulnerable to invalid inputs, i.e., those cases that are not considered in the training phase of a DL model. This paper proposes to perform data sanity check to identify invalid inputs, so as to enhance the reliability of DL-based systems. We design and implement a tool to detect behavior deviation of a DL model when processing an input case. This tool extracts the data flow footprints and conducts an assertion-based validation mechanism. The assertions are built automatically, which are specifically-tailored for DL model data flow analysis. Our experiments conducted with real-world scenarios demonstrate that such an assertion-based data sanity check mechanism is effective in identifying invalid input cases. △ Less

Submitted 28 September, 2019; v1 submitted 6 September, 2019; originally announced September 2019.

arXiv:1907.04371 [pdf, other]

Ordered SGD: A New Stochastic Optimization Framework for Empirical Risk Minimization

Authors: Kenji Kawaguchi, Haihao Lu

Abstract: We propose a new stochastic optimization framework for empirical risk minimization problems such as those that arise in machine learning. The traditional approaches, such as (mini-batch) stochastic gradient descent (SGD), utilize an unbiased gradient estimator of the empirical average loss. In contrast, we develop a computationally efficient method to construct a gradient estimator that is purpose… ▽ More We propose a new stochastic optimization framework for empirical risk minimization problems such as those that arise in machine learning. The traditional approaches, such as (mini-batch) stochastic gradient descent (SGD), utilize an unbiased gradient estimator of the empirical average loss. In contrast, we develop a computationally efficient method to construct a gradient estimator that is purposely biased toward those observations with higher current losses. On the theory side, we show that the proposed method minimizes a new ordered modification of the empirical average loss, and is guaranteed to converge at a sublinear rate to a global optimum for convex loss and to a critical point for weakly convex (non-convex) loss. Furthermore, we prove a new generalization bound for the proposed algorithm. On the empirical side, the numerical experiments show that our proposed method consistently improves the test errors compared with the standard mini-batch SGD in various models including SVM, logistic regression, and deep learning problems. △ Less

Submitted 1 February, 2020; v1 submitted 9 July, 2019; originally announced July 2019.

Comments: Accepted in AISTATS 2020. Code available at: https://github.com/k9k2/qSGD

arXiv:1906.01407 [pdf, other]

RL4health: Crowdsourcing Reinforcement Learning for Knee Replacement Pathway Optimization

Authors: Hao Lu, Mengdi Wang

Abstract: Joint replacement is the most common inpatient surgical treatment in the US. We investigate the clinical pathway optimization for knee replacement, which is a sequential decision process from onset to recovery. Based on episodic claims from previous cases, we view the pathway optimization as an intelligence crowdsourcing problem and learn the optimal decision policy from data by imitating the best… ▽ More Joint replacement is the most common inpatient surgical treatment in the US. We investigate the clinical pathway optimization for knee replacement, which is a sequential decision process from onset to recovery. Based on episodic claims from previous cases, we view the pathway optimization as an intelligence crowdsourcing problem and learn the optimal decision policy from data by imitating the best expert at every intermediate state. We develop a reinforcement learning-based pipeline that uses value iteration, state compression and aggregation learning, kernel representation and cross validation to predict the best treatment policy. It also provides forecast of the clinical pathway under the optimized policy. Empirical validation shows that the optimized policy reduces the overall cost by 7 percent and reduces the excessive cost premium by 33 percent. △ Less

Submitted 24 May, 2019; originally announced June 2019.

arXiv:1905.05884 [pdf, other]

Approximate Bayesian computation via the energy statistic

Authors: Hien D. Nguyen, Julyan Arbel, Hongliang Lü, Florence Forbes

Abstract: Approximate Bayesian computation (ABC) has become an essential part of the Bayesian toolbox for addressing problems in which the likelihood is prohibitively expensive or entirely unknown, making it intractable. ABC defines a pseudo-posterior by comparing observed data with simulated data, traditionally based on some summary statistics, the elicitation of which is regarded as a key difficulty. Rece… ▽ More Approximate Bayesian computation (ABC) has become an essential part of the Bayesian toolbox for addressing problems in which the likelihood is prohibitively expensive or entirely unknown, making it intractable. ABC defines a pseudo-posterior by comparing observed data with simulated data, traditionally based on some summary statistics, the elicitation of which is regarded as a key difficulty. Recently, using data discrepancy measures has been proposed in order to bypass the construction of summary statistics. Here we propose to use the importance-sampling ABC (IS-ABC) algorithm relying on the so-called two-sample energy statistic. We establish a new asymptotic result for the case where both the observed sample size and the simulated data sample size increase to infinity, which highlights to what extent the data discrepancy measure impacts the asymptotic pseudo-posterior. The result holds in the broad setting of IS-ABC methodologies, thus generalizing previous results that have been established only for rejection ABC algorithms. Furthermore, we propose a consistent V-statistic estimator of the energy statistic, under which we show that the large sample result holds, and prove that the rejection ABC algorithm, based on the energy statistic, generates pseudo-posterior distributions that achieves convergence to the correct limits, when implemented with rejection thresholds that converge to zero, in the finite sample setting. Our proposed energy statistic based ABC algorithm is demonstrated on a variety of models, including a Gaussian mixture, a moving-average model of order two, a bivariate beta and a multivariate $g$-and-$k$ distribution. We find that our proposed method compares well with alternative discrepancy measures. △ Less

Submitted 30 June, 2020; v1 submitted 14 May, 2019; originally announced May 2019.

Comments: 25 pages, 6 figures, 5 tables

Journal ref: IEEE Access (2020)

arXiv:1904.05961 [pdf, other]

Robust Coreset Construction for Distributed Machine Learning

Authors: Hanlin Lu, Ming-Ju Li, Ting He, Shiqiang Wang, Vijaykrishnan Narayanan, Kevin S Chan

Abstract: Coreset, which is a summary of the original dataset in the form of a small weighted set in the same sample space, provides a promising approach to enable machine learning over distributed data. Although viewed as a proxy of the original dataset, each coreset is only designed to approximate the cost function of a specific machine learning problem, and thus different coresets are often required to s… ▽ More Coreset, which is a summary of the original dataset in the form of a small weighted set in the same sample space, provides a promising approach to enable machine learning over distributed data. Although viewed as a proxy of the original dataset, each coreset is only designed to approximate the cost function of a specific machine learning problem, and thus different coresets are often required to solve different machine learning problems, increasing the communication overhead. We resolve this dilemma by developing robust coreset construction algorithms that can support a variety of machine learning problems. Motivated by empirical evidence that suitably-weighted k-clustering centers provide a robust coreset, we harden the observation by establishing theoretical conditions under which the coreset provides a guaranteed approximation for a broad range of machine learning problems, and developing both centralized and distributed algorithms to generate coresets satisfying the conditions. The robustness of the proposed algorithms is verified through extensive experiments on diverse datasets with respect to both supervised and unsupervised learning problems. △ Less

Submitted 22 June, 2020; v1 submitted 11 April, 2019; originally announced April 2019.

arXiv:1903.11020 [pdf, other]

doi 10.1609/aaai.v34i04.6179

Domain Independent SVM for Transfer Learning in Brain Decoding

Authors: Shuo Zhou, Wenwen Li, Christopher R. Cox, Haiping Lu

Abstract: Brain imaging data are important in brain sciences yet expensive to obtain, with big volume (i.e., large p) but small sample size (i.e., small n). To tackle this problem, transfer learning is a promising direction that leverages source data to improve performance on related, target data. Most transfer learning methods focus on minimizing data distribution mismatch. However, a big challenge in brai… ▽ More Brain imaging data are important in brain sciences yet expensive to obtain, with big volume (i.e., large p) but small sample size (i.e., small n). To tackle this problem, transfer learning is a promising direction that leverages source data to improve performance on related, target data. Most transfer learning methods focus on minimizing data distribution mismatch. However, a big challenge in brain imaging is the large domain discrepancies in cognitive experiment designs and subject-specific structures and functions. A recent transfer learning approach minimizes domain dependence to learn common features across domains, via the Hilbert-Schmidt Independence Criterion (HSIC). Inspired by this method, we propose a new Domain Independent Support Vector Machine (DI-SVM) for transfer learning in brain condition decoding. Specifically, DI-SVM simultaneously minimizes the SVM empirical risk and the dependence on domain information via a simplified HSIC. We use public data to construct 13 transfer learning tasks in brain decoding, including three interesting multi-source transfer tasks. Experiments show that DI-SVM's superior performance over eight competing methods on these tasks, particularly an improvement of more than 24% on multi-source transfer tasks. △ Less

Submitted 26 March, 2019; originally announced March 2019.

arXiv:1903.08708 [pdf, other]

Accelerating Gradient Boosting Machine

Authors: Haihao Lu, Sai Praneeth Karimireddy, Natalia Ponomareva, Vahab Mirrokni

Abstract: Gradient Boosting Machine (GBM) is an extremely powerful supervised learning algorithm that is widely used in practice. GBM routinely features as a leading algorithm in machine learning competitions such as Kaggle and the KDDCup. In this work, we propose Accelerated Gradient Boosting Machine (AGBM) by incorporating Nesterov's acceleration techniques into the design of GBM. The difficulty in accele… ▽ More Gradient Boosting Machine (GBM) is an extremely powerful supervised learning algorithm that is widely used in practice. GBM routinely features as a leading algorithm in machine learning competitions such as Kaggle and the KDDCup. In this work, we propose Accelerated Gradient Boosting Machine (AGBM) by incorporating Nesterov's acceleration techniques into the design of GBM. The difficulty in accelerating GBM lies in the fact that weak (inexact) learners are commonly used, and therefore the errors can accumulate in the momentum term. To overcome it, we design a "corrected pseudo residual" and fit best weak learner to this corrected pseudo residual, in order to perform the z-update. Thus, we are able to derive novel computational guarantees for AGBM. This is the first GBM type of algorithm with theoretically-justified accelerated convergence rate. Finally we demonstrate with a number of numerical experiments the effectiveness of AGBM over conventional GBM in obtaining a model with good training and/or testing data fidelity. △ Less

Submitted 27 August, 2020; v1 submitted 20 March, 2019; originally announced March 2019.

arXiv:1902.07903 [pdf, other]

Learning Deterministic Policy with Target for Power Control in Wireless Networks

Authors: Yujiao Lu, Hancheng Lu, Liangliang Cao, Feng Wu, Daren Zhu

Abstract: Inter-Cell Interference Coordination (ICIC) is a promising way to improve energy efficiency in wireless networks, especially where small base stations are densely deployed. However, traditional optimization based ICIC schemes suffer from severe performance degradation with complex interference pattern. To address this issue, we propose a Deep Reinforcement Learning with Deterministic Policy and Ta… ▽ More Inter-Cell Interference Coordination (ICIC) is a promising way to improve energy efficiency in wireless networks, especially where small base stations are densely deployed. However, traditional optimization based ICIC schemes suffer from severe performance degradation with complex interference pattern. To address this issue, we propose a Deep Reinforcement Learning with Deterministic Policy and Target (DRL-DPT) framework for ICIC in wireless networks. DRL-DPT overcomes the main obstacles in applying reinforcement learning and deep learning in wireless networks, i.e. continuous state space, continuous action space and convergence. Firstly, a Deep Neural Network (DNN) is involved as the actor to obtain deterministic power control actions in continuous space. Then, to guarantee the convergence, an online training process is presented, which makes use of a dedicated reward function as the target rule and a policy gradient descent algorithm to adjust DNN weights. Experimental results show that the proposed DRL-DPT framework consistently outperforms existing schemes in terms of energy efficiency and throughput under different wireless interference scenarios. More specifically, it improves up to 15% of energy efficiency with faster convergence rate. △ Less

Submitted 21 February, 2019; originally announced February 2019.

Comments: 7 pages, 7 figures, GlobeCom2018

arXiv:1812.10140 [pdf, other]

Mixed-Order Spectral Clustering for Networks

Authors: Yan Ge, Haiping Lu, Pan Peng

Abstract: Clustering is fundamental for gaining insights from complex networks, and spectral clustering (SC) is a popular approach. Conventional SC focuses on second-order structures (e.g., edges connecting two nodes) without direct consideration of higher-order structures (e.g., triangles and cliques). This has motivated SC extensions that directly consider higher-order structures. However, both approaches… ▽ More Clustering is fundamental for gaining insights from complex networks, and spectral clustering (SC) is a popular approach. Conventional SC focuses on second-order structures (e.g., edges connecting two nodes) without direct consideration of higher-order structures (e.g., triangles and cliques). This has motivated SC extensions that directly consider higher-order structures. However, both approaches are limited to considering a single order. This paper proposes a new Mixed-Order Spectral Clustering (MOSC) approach to model both second-order and third-order structures simultaneously, with two MOSC methods developed based on Graph Laplacian (GL) and Random Walks (RW). MOSC-GL combines edge and triangle adjacency matrices, with theoretical performance guarantee. MOSC-RW combines first-order and second-order random walks for a probabilistic interpretation. We automatically determine the mixing parameter based on cut criteria or triangle density, and construct new structure-aware error metrics for performance evaluation. Experiments on real-world networks show 1) the superior performance of two MOSC methods over existing SC methods, 2) the effectiveness of the mixing parameter determination strategy, and 3) insights offered by the structure-aware error metrics. △ Less

Submitted 25 December, 2018; originally announced December 2018.

Comments: 12 pages

arXiv:1812.00086 [pdf, other]

Graph Node-Feature Convolution for Representation Learning

Authors: Li Zhang, Heda Song, Nikolaos Aletras, Haiping Lu

Abstract: Graph convolutional network (GCN) is an emerging neural network approach. It learns new representation of a node by aggregating feature vectors of all neighbors in the aggregation process without considering whether the neighbors or features are useful or not. Recent methods have improved solutions by sampling a fixed size set of neighbors, or assigning different weights to different neighbors in… ▽ More Graph convolutional network (GCN) is an emerging neural network approach. It learns new representation of a node by aggregating feature vectors of all neighbors in the aggregation process without considering whether the neighbors or features are useful or not. Recent methods have improved solutions by sampling a fixed size set of neighbors, or assigning different weights to different neighbors in the aggregation process, but features within a feature vector are still treated equally in the aggregation process. In this paper, we introduce a new convolution operation on regular size feature maps constructed from features of a fixed node bandwidth via sampling to get the first-level node representation, which is then passed to a standard GCN to learn the second-level node representation. Experiments show that our method outperforms competing methods in semi-supervised node classification tasks. Furthermore, our method opens new doors for exploring new GCN architectures, particularly deeper GCN models. △ Less

Submitted 31 March, 2022; v1 submitted 30 November, 2018; originally announced December 2018.

arXiv:1811.11017 [pdf, other]

Latent Dirichlet Allocation with Residual Convolutional Neural Network Applied in Evaluating Credibility of Chinese Listed Companies

Authors: Mohan Zhang, Zhichao Luo, Hai Lu

Abstract: This project demonstrated a methodology to estimating cooperate credibility with a Natural Language Processing approach. As cooperate transparency impacts both the credibility and possible future earnings of the firm, it is an important factor to be considered by banks and investors on risk assessments of listed firms. This approach of estimating cooperate credibility can bypass human bias and inc… ▽ More This project demonstrated a methodology to estimating cooperate credibility with a Natural Language Processing approach. As cooperate transparency impacts both the credibility and possible future earnings of the firm, it is an important factor to be considered by banks and investors on risk assessments of listed firms. This approach of estimating cooperate credibility can bypass human bias and inconsistency in the risk assessment, the use of large quantitative data and neural network models provides more accurate estimation in a more efficient manner compare to manual assessment. At the beginning, the model will employs Latent Dirichlet Allocation and THU Open Chinese Lexicon from Tsinghua University to classify topics in articles which are potentially related to corporate credibility. Then with the keywords related to each topics, we trained a residual convolutional neural network with data labeled according to surveys of fund manager and accountant's opinion on corporate credibility. After the training, we run the model with preprocessed news reports regarding to all of the 3065 listed companies, the model is supposed to give back companies ranking based on the level of their transparency. △ Less

Submitted 24 November, 2018; originally announced November 2018.

arXiv:1810.10158 [pdf, other]

Randomized Gradient Boosting Machine

Authors: Haihao Lu, Rahul Mazumder

Abstract: Gradient Boosting Machine (GBM) introduced by Friedman is a powerful supervised learning algorithm that is very widely used in practice---it routinely features as a leading algorithm in machine learning competitions such as Kaggle and the KDDCup. In spite of the usefulness of GBM in practice, our current theoretical understanding of this method is rather limited. In this work, we propose Randomize… ▽ More Gradient Boosting Machine (GBM) introduced by Friedman is a powerful supervised learning algorithm that is very widely used in practice---it routinely features as a leading algorithm in machine learning competitions such as Kaggle and the KDDCup. In spite of the usefulness of GBM in practice, our current theoretical understanding of this method is rather limited. In this work, we propose Randomized Gradient Boosting Machine (RGBM) which leads to substantial computational gains compared to GBM, by using a randomization scheme to reduce search in the space of weak-learners. We derive novel computational guarantees for RGBM. We also provide a principled guideline towards better step-size selection in RGBM that does not require a line search. Our proposed framework is inspired by a special variant of coordinate descent that combines the benefits of randomized coordinate descent and greedy coordinate descent; and may be of independent interest as an optimization algorithm. As a special case, our results for RGBM lead to superior computational guarantees for GBM. Our computational guarantees depend upon a curious geometric quantity that we call Minimal Cosine Angle, which relates to the density of weak-learners in the prediction space. On a series of numerical experiments on real datasets, we demonstrate the effectiveness of RGBM over GBM in terms of obtaining a model with good training and/or testing data fidelity with a fraction of the computational cost. △ Less

Submitted 15 September, 2020; v1 submitted 23 October, 2018; originally announced October 2018.

arXiv:1810.09177 [pdf, other]

Compositional Coding Capsule Network with K-Means Routing for Text Classification

Authors: Hao Ren, Hong Lu

Abstract: Text classification is a challenging problem which aims to identify the category of texts. In the process of training, word embeddings occupy a large part of parameters. Under the limitation of limited computing resources, it indirectly limits the ability of subsequent network designs. In order to reduce the number of parameters, the compositional coding mechanism has been proposed recently. Based… ▽ More Text classification is a challenging problem which aims to identify the category of texts. In the process of training, word embeddings occupy a large part of parameters. Under the limitation of limited computing resources, it indirectly limits the ability of subsequent network designs. In order to reduce the number of parameters, the compositional coding mechanism has been proposed recently. Based on this, this paper further explores compositional coding and proposes a compositional weighted coding method. And we apply capsule network to model the relationship between word embeddings, a new routing algorithm, which is based on k-means clustering theory, is proposed to fully mine the relationship between word embeddings. Combined with our compositional weighted coding method and the routing algorithm, we design a neural network for text classification. Experiments conducted on eight challenging text classification datasets show that the proposed method achieves competitive accuracy compared to the state-of-the-art approach with significantly fewer parameters. △ Less

Submitted 2 June, 2022; v1 submitted 22 October, 2018; originally announced October 2018.

Comments: the paper is accepted by Pattern Recognition Letters, please refer https://www.sciencedirect.com/science/article/pii/S016786552200188X for an updated version

arXiv:1810.02716 [pdf, other]

Approximate Leave-One-Out for High-Dimensional Non-Differentiable Learning Problems

Authors: Shuaiwen Wang, Wenda Zhou, Arian Maleki, Haihao Lu, Vahab Mirrokni

Abstract: Consider the following class of learning schemes: \begin{equation} \label{eq:main-problem1} \hat{\boldsymbolβ} := \underset{\boldsymbolβ \in \mathcal{C}}{\arg\min} \;\sum_{j=1}^n \ell(\boldsymbol{x}_j^\top\boldsymbolβ; y_j) + λR(\boldsymbolβ), \qquad \qquad \qquad (1) \end{equation} where $\boldsymbol{x}_i \in \mathbb{R}^p$ and $y_i \in \mathbb{R}$ denote the $i^{\rm th}$ feature and response va… ▽ More Consider the following class of learning schemes: \begin{equation} \label{eq:main-problem1} \hat{\boldsymbolβ} := \underset{\boldsymbolβ \in \mathcal{C}}{\arg\min} \;\sum_{j=1}^n \ell(\boldsymbol{x}_j^\top\boldsymbolβ; y_j) + λR(\boldsymbolβ), \qquad \qquad \qquad (1) \end{equation} where $\boldsymbol{x}_i \in \mathbb{R}^p$ and $y_i \in \mathbb{R}$ denote the $i^{\rm th}$ feature and response variable respectively. Let $\ell$ and $R$ be the convex loss function and regularizer, $\boldsymbolβ$ denote the unknown weights, and $λ$ be a regularization parameter. $\mathcal{C} \subset \mathbb{R}^{p}$ is a closed convex set. Finding the optimal choice of $λ$ is a challenging problem in high-dimensional regimes where both $n$ and $p$ are large. We propose three frameworks to obtain a computationally efficient approximation of the leave-one-out cross validation (LOOCV) risk for nonsmooth losses and regularizers. Our three frameworks are based on the primal, dual, and proximal formulations of (1). Each framework shows its strength in certain types of problems. We prove the equivalence of the three approaches under smoothness conditions. This equivalence enables us to justify the accuracy of the three methods under such conditions. We use our approaches to obtain a risk estimate for several standard problems, including generalized LASSO, nuclear norm regularization, and support vector machines. We empirically demonstrate the effectiveness of our results for non-differentiable cases. △ Less

Submitted 4 October, 2018; originally announced October 2018.

Comments: 63 pages, 7 figures. arXiv admin note: substantial text overlap with arXiv:1807.02694

arXiv:1807.02694 [pdf, other]

Approximate Leave-One-Out for Fast Parameter Tuning in High Dimensions

Authors: Shuaiwen Wang, Wenda Zhou, Haihao Lu, Arian Maleki, Vahab Mirrokni

Abstract: Consider the following class of learning schemes: $$\hat{\boldsymbolβ} := \arg\min_{\boldsymbolβ}\;\sum_{j=1}^n \ell(\boldsymbol{x}_j^\top\boldsymbolβ; y_j) + λR(\boldsymbolβ),\qquad\qquad (1) $$ where $\boldsymbol{x}_i \in \mathbb{R}^p$ and $y_i \in \mathbb{R}$ denote the $i^{\text{th}}$ feature and response variable respectively. Let $\ell$ and $R$ be the loss function and regularizer,… ▽ More Consider the following class of learning schemes: $$\hat{\boldsymbolβ} := \arg\min_{\boldsymbolβ}\;\sum_{j=1}^n \ell(\boldsymbol{x}_j^\top\boldsymbolβ; y_j) + λR(\boldsymbolβ),\qquad\qquad (1) $$ where $\boldsymbol{x}_i \in \mathbb{R}^p$ and $y_i \in \mathbb{R}$ denote the $i^{\text{th}}$ feature and response variable respectively. Let $\ell$ and $R$ be the loss function and regularizer, $\boldsymbolβ$ denote the unknown weights, and $λ$ be a regularization parameter. Finding the optimal choice of $λ$ is a challenging problem in high-dimensional regimes where both $n$ and $p$ are large. We propose two frameworks to obtain a computationally efficient approximation ALO of the leave-one-out cross validation (LOOCV) risk for nonsmooth losses and regularizers. Our two frameworks are based on the primal and dual formulations of (1). We prove the equivalence of the two approaches under smoothness conditions. This equivalence enables us to justify the accuracy of both methods under such conditions. We use our approaches to obtain a risk estimate for several standard problems, including generalized LASSO, nuclear norm regularization, and support vector machines. We empirically demonstrate the effectiveness of our results for non-differentiable cases. △ Less

Submitted 7 July, 2018; originally announced July 2018.

Comments: The paper is published on ICML 2018

arXiv:1712.00573 [pdf, other]

Supervised Hashing based on Energy Minimization

Authors: Zihao Hu, Xiyi Luo, Hongtao Lu, Yong Yu

Abstract: Recently, supervised hashing methods have attracted much attention since they can optimize retrieval speed and storage cost while preserving semantic information. Because hashing codes learning is NP-hard, many methods resort to some form of relaxation technique. But the performance of these methods can easily deteriorate due to the relaxation. Luckily, many supervised hashing formulations can be… ▽ More Recently, supervised hashing methods have attracted much attention since they can optimize retrieval speed and storage cost while preserving semantic information. Because hashing codes learning is NP-hard, many methods resort to some form of relaxation technique. But the performance of these methods can easily deteriorate due to the relaxation. Luckily, many supervised hashing formulations can be viewed as energy functions, hence solving hashing codes is equivalent to learning marginals in the corresponding conditional random field (CRF). By minimizing the KL divergence between a fully factorized distribution and the Gibbs distribution of this CRF, a set of consistency equations can be obtained, but updating them in parallel may not yield a local optimum since the variational lower bound is not guaranteed to increase. In this paper, we use a linear approximation of the sigmoid function to convert these consistency equations to linear systems, which have a closed-form solution. By applying this novel technique to two classical hashing formulations KSH and SPLH, we obtain two new methods called EM (energy minimizing based)-KSH and EM-SPLH. Experimental results on three datasets show the superiority of our methods. △ Less

Submitted 2 December, 2017; originally announced December 2017.

arXiv:1702.08580 [pdf, ps, other]

Depth Creates No Bad Local Minima

Authors: Haihao Lu, Kenji Kawaguchi

Abstract: In deep learning, \textit{depth}, as well as \textit{nonlinearity}, create non-convex loss surfaces. Then, does depth alone create bad local minima? In this paper, we prove that without nonlinearity, depth alone does not create bad local minima, although it induces non-convex loss surface. Using this insight, we greatly simplify a recently proposed proof to show that all of the local minima of fee… ▽ More In deep learning, \textit{depth}, as well as \textit{nonlinearity}, create non-convex loss surfaces. Then, does depth alone create bad local minima? In this paper, we prove that without nonlinearity, depth alone does not create bad local minima, although it induces non-convex loss surface. Using this insight, we greatly simplify a recently proposed proof to show that all of the local minima of feedforward deep linear neural networks are global minima. Our theoretical results generalize previous results with fewer assumptions, and this analysis provides a method to show similar results beyond square loss in deep linear models. △ Less

Submitted 23 May, 2017; v1 submitted 27 February, 2017; originally announced February 2017.

arXiv:1604.02100 [pdf, other]

doi 10.1109/TSP.2017.2695566

Hankel Matrix Nuclear Norm Regularized Tensor Completion for $N$-dimensional Exponential Signals

Authors: Jiaxi Ying, Hengfa Lu, Qingtao Wei, Jian-Feng Cai, Di Guo, Jihui Wu, Zhong Chen, Xiaobo Qu

Abstract: Signals are generally modeled as a superposition of exponential functions in spectroscopy of chemistry, biology and medical imaging. For fast data acquisition or other inevitable reasons, however, only a small amount of samples may be acquired and thus how to recover the full signal becomes an active research topic. But existing approaches can not efficiently recover $N$-dimensional exponential si… ▽ More Signals are generally modeled as a superposition of exponential functions in spectroscopy of chemistry, biology and medical imaging. For fast data acquisition or other inevitable reasons, however, only a small amount of samples may be acquired and thus how to recover the full signal becomes an active research topic. But existing approaches can not efficiently recover $N$-dimensional exponential signals with $N\geq 3$. In this paper, we study the problem of recovering N-dimensional (particularly $N\geq 3$) exponential signals from partial observations, and formulate this problem as a low-rank tensor completion problem with exponential factor vectors. The full signal is reconstructed by simultaneously exploiting the CANDECOMP/PARAFAC structure and the exponential structure of the associated factor vectors. The latter is promoted by minimizing an objective function involving the nuclear norm of Hankel matrices. Experimental results on simulated and real magnetic resonance spectroscopy data show that the proposed approach can successfully recover full signals from very limited samples and is robust to the estimated tensor rank. △ Less

Submitted 31 March, 2017; v1 submitted 6 April, 2016; originally announced April 2016.

Comments: 15 pages, 12 figures

arXiv:1504.08142 [pdf, other]

Semi-Orthogonal Multilinear PCA with Relaxed Start

Authors: Qiquan Shi, Haiping Lu

Abstract: Principal component analysis (PCA) is an unsupervised method for learning low-dimensional features with orthogonal projections. Multilinear PCA methods extend PCA to deal with multidimensional data (tensors) directly via tensor-to-tensor projection or tensor-to-vector projection (TVP). However, under the TVP setting, it is difficult to develop an effective multilinear PCA method with the orthogona… ▽ More Principal component analysis (PCA) is an unsupervised method for learning low-dimensional features with orthogonal projections. Multilinear PCA methods extend PCA to deal with multidimensional data (tensors) directly via tensor-to-tensor projection or tensor-to-vector projection (TVP). However, under the TVP setting, it is difficult to develop an effective multilinear PCA method with the orthogonality constraint. This paper tackles this problem by proposing a novel Semi-Orthogonal Multilinear PCA (SO-MPCA) approach. SO-MPCA learns low-dimensional features directly from tensors via TVP by imposing the orthogonality constraint in only one mode. This formulation results in more captured variance and more learned features than full orthogonality. For better generalization, we further introduce a relaxed start (RS) strategy to get SO-MPCA-RS by fixing the starting projection vectors, which increases the bias and reduces the variance of the learning model. Experiments on both face (2D) and gait (3D) data demonstrate that SO-MPCA-RS outperforms other competing algorithms on the whole, and the relaxed start strategy is also effective for other TVP-based PCA methods. △ Less

Submitted 6 May, 2015; v1 submitted 30 April, 2015; originally announced April 2015.

Comments: 8 pages, 2 figures, to appear in Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI 2015)

ACM Class: I.2.6

arXiv:1410.3561 [pdf, ps, other]

Sufficient dimension reduction with additional information

Authors: Hung Hung, Chih-Yen Liu, Henry Horng-Shing Lu

Abstract: Sufficient dimension reduction is widely applied to help model building between the response $Y$ and covariate $X$. While the target of interest is the relationship between $(Y,X)$, in some applications we also collect additional variable $W$ that is strongly correlated with $Y$. From a statistical point of view, making inference about $(Y,X)$ without using $W$ will lose efficiency. However, it is… ▽ More Sufficient dimension reduction is widely applied to help model building between the response $Y$ and covariate $X$. While the target of interest is the relationship between $(Y,X)$, in some applications we also collect additional variable $W$ that is strongly correlated with $Y$. From a statistical point of view, making inference about $(Y,X)$ without using $W$ will lose efficiency. However, it is not trivial to incorporate the information of $W$ to infer $(Y,X)$. In this article, we propose a two-stage dimension reduction method for $(Y,X)$, that is able to utilize the additional information from $W$. The main idea is to confine the searching space, by constructing an envelope subspace for the target of interest. In the analysis of breast cancer data, the risk score constructed from the two-stage method can well separate patients with different survival experiences. In the Pima data, the two-stage method requires fewer components to infer the diabetes status, while achieving higher classification accuracy than conventional method. △ Less

Submitted 13 October, 2014; originally announced October 2014.

Comments: 26 pages, 4 figures, 1 table

arXiv:1408.5352 [pdf, other]

Nonconvex Statistical Optimization: Minimax-Optimal Sparse PCA in Polynomial Time

Authors: Zhaoran Wang, Huanran Lu, Han Liu

Abstract: Sparse principal component analysis (PCA) involves nonconvex optimization for which the global solution is hard to obtain. To address this issue, one popular approach is convex relaxation. However, such an approach may produce suboptimal estimators due to the relaxation effect. To optimally estimate sparse principal subspaces, we propose a two-stage computational framework named "tighten after rel… ▽ More Sparse principal component analysis (PCA) involves nonconvex optimization for which the global solution is hard to obtain. To address this issue, one popular approach is convex relaxation. However, such an approach may produce suboptimal estimators due to the relaxation effect. To optimally estimate sparse principal subspaces, we propose a two-stage computational framework named "tighten after relax": Within the 'relax' stage, we approximately solve a convex relaxation of sparse PCA with early stopping to obtain a desired initial estimator; For the 'tighten' stage, we propose a novel algorithm called sparse orthogonal iteration pursuit (SOAP), which iteratively refines the initial estimator by directly solving the underlying nonconvex problem. A key concept of this two-stage framework is the basin of attraction. It represents a local region within which the `tighten' stage has desired computational and statistical guarantees. We prove that, the initial estimator obtained from the 'relax' stage falls into such a region, and hence SOAP geometrically converges to a principal subspace estimator which is minimax-optimal within a certain model class. Unlike most existing sparse PCA estimators, our approach applies to the non-spiked covariance models, and adapts to non-Gaussianity as well as dependent data settings. Moreover, through analyzing the computational complexity of the two stages, we illustrate an interesting phenomenon that larger sample size can reduce the total iteration complexity. Our framework motivates a general paradigm for solving many complex statistical problems which involve nonconvex optimization with provable guarantees. △ Less

Submitted 22 August, 2014; originally announced August 2014.

Comments: 64 pages, 8 figures

arXiv:1307.0293 [pdf, other]

A Direct Estimation of High Dimensional Stationary Vector Autoregressions

Authors: Fang Han, Huanran Lu, Han Liu

Abstract: The vector autoregressive (VAR) model is a powerful tool in modeling complex time series and has been exploited in many fields. However, fitting high dimensional VAR model poses some unique challenges: On one hand, the dimensionality, caused by modeling a large number of time series and higher order autoregressive processes, is usually much higher than the time series length; On the other hand, th… ▽ More The vector autoregressive (VAR) model is a powerful tool in modeling complex time series and has been exploited in many fields. However, fitting high dimensional VAR model poses some unique challenges: On one hand, the dimensionality, caused by modeling a large number of time series and higher order autoregressive processes, is usually much higher than the time series length; On the other hand, the temporal dependence structure in the VAR model gives rise to extra theoretical challenges. In high dimensions, one popular approach is to assume the transition matrix is sparse and fit the VAR model using the "least squares" method with a lasso-type penalty. In this manuscript, we propose an alternative way in estimating the VAR model. The main idea is, via exploiting the temporal dependence structure, to formulate the estimating problem into a linear program. There is instant advantage for the proposed approach over the lasso-type estimators: The estimation equation can be decomposed into multiple sub-equations and accordingly can be efficiently solved in a parallel fashion. In addition, our method brings new theoretical insights into the VAR model analysis. So far the theoretical results developed in high dimensions (e.g., Song and Bickel (2011) and Kock and Callot (2012)) mainly pose assumptions on the design matrix of the formulated regression problems. Such conditions are indirect about the transition matrices and not transparent. In contrast, our results show that the operator norm of the transition matrices plays an important role in estimation accuracy. We provide explicit rates of convergence for both estimation and prediction. In addition, we provide thorough experiments on both synthetic and real-world equity data to show that there are empirical advantages of our method over the lasso-type estimators in both parameter estimation and forecasting. △ Less

Submitted 28 October, 2014; v1 submitted 1 July, 2013; originally announced July 2013.

Comments: 36 pages, 3 figure

Showing 1–50 of 50 results for author: Lue, H