Search | arXiv e-print repository

Effective climate policies for major emission reductions of ozone precursors: Global evidence from two decades

Authors: Ningning Yao, Huan Xi, Lang Chen, Zhe Song, Jian Li, Yulei Chen, Baocai Guo, Yuanhang Zhang, Tong Zhu, Pengfei Li, Daniel Rosenfeld, John H. Seinfeld, Shaocai Yu

Abstract: Despite policymakers deploying various tools to mitigate emissions of ozone (O\textsubscript{3}) precursors, such as nitrogen oxides (NO\textsubscript{x}), carbon monoxide (CO), and volatile organic compounds (VOCs), the effectiveness of policy combinations remains uncertain. We employ an integrated framework that couples structural break detection with machine learning to pinpoint effective inter… ▽ More Despite policymakers deploying various tools to mitigate emissions of ozone (O\textsubscript{3}) precursors, such as nitrogen oxides (NO\textsubscript{x}), carbon monoxide (CO), and volatile organic compounds (VOCs), the effectiveness of policy combinations remains uncertain. We employ an integrated framework that couples structural break detection with machine learning to pinpoint effective interventions across the building, electricity, industrial, and transport sectors, identifying treatment effects as abrupt changes without prior assumptions about policy treatment assignment and timing. Applied to two decades of global O\textsubscript{3} precursor emissions data, we detect 78, 77, and 78 structural breaks for NO\textsubscript{x}, CO, and VOCs, corresponding to cumulative emission reductions of 0.96-0.97 Gt, 2.84-2.88 Gt, and 0.47-0.48 Gt, respectively. Sector-level analysis shows that electricity sector structural policies cut NO\textsubscript{x} by up to 32.4\%, while in buildings, developed countries combined adoption subsidies with carbon taxes to achieve 42.7\% CO reductions and developing countries used financing plus fuel taxes to secure 52.3\%. VOCs abatement peaked at 38.5\% when fossil-fuel subsidy reforms were paired with financial incentives. Finally, hybrid strategies merging non-price measures (subsidies, bans, mandates) with pricing instruments delivered up to an additional 10\% co-benefit. These findings guide the sequencing and complementarity of context-specific policy portfolios for O\textsubscript{3} precursor mitigation. △ Less

Submitted 19 May, 2025; originally announced May 2025.

Comments: There are 30 pages of 12 figures

arXiv:2505.12444 [pdf, other]

High-Dimensional Dynamic Covariance Models with Random Forests

Authors: Shuguang Yu, Fan Zhou, Yingjie Zhang, Ziqi Chen, Hongtu Zhu

Abstract: This paper introduces a novel nonparametric method for estimating high-dimensional dynamic covariance matrices with multiple conditioning covariates, leveraging random forests and supported by robust theoretical guarantees. Unlike traditional static methods, our dynamic nonparametric covariance models effectively capture distributional heterogeneity. Furthermore, unlike kernel-smoothing methods, w… ▽ More This paper introduces a novel nonparametric method for estimating high-dimensional dynamic covariance matrices with multiple conditioning covariates, leveraging random forests and supported by robust theoretical guarantees. Unlike traditional static methods, our dynamic nonparametric covariance models effectively capture distributional heterogeneity. Furthermore, unlike kernel-smoothing methods, which are restricted to a single conditioning covariate, our approach accommodates multiple covariates in a fully nonparametric framework. To the best of our knowledge, this is the first method to use random forests for estimating high-dimensional dynamic covariance matrices. In high-dimensional settings, we establish uniform consistency theory, providing nonasymptotic error rates and model selection properties, even when the response dimension grows sub-exponentially with the sample size. These results hold uniformly across a range of conditioning variables. The method's effectiveness is demonstrated through simulations and a stock dataset analysis, highlighting its ability to model complex dynamics in high-dimensional scenarios. △ Less

Submitted 18 May, 2025; originally announced May 2025.

arXiv:2505.12225 [pdf, other]

Reward Inside the Model: A Lightweight Hidden-State Reward Model for LLM's Best-of-N sampling

Authors: Jizhou Guo, Zhaomin Wu, Philip S. Yu

Abstract: High-quality reward models are crucial for unlocking the reasoning potential of large language models (LLMs), with best-of-N voting demonstrating significant performance gains. However, current reward models, which typically operate on the textual output of LLMs, are computationally expensive and parameter-heavy, limiting their real-world applications. We introduce the Efficient Linear Hidden Stat… ▽ More High-quality reward models are crucial for unlocking the reasoning potential of large language models (LLMs), with best-of-N voting demonstrating significant performance gains. However, current reward models, which typically operate on the textual output of LLMs, are computationally expensive and parameter-heavy, limiting their real-world applications. We introduce the Efficient Linear Hidden State Reward (ELHSR) model - a novel, highly parameter-efficient approach that leverages the rich information embedded in LLM hidden states to address these issues. ELHSR systematically outperform baselines with less than 0.005% of the parameters of baselines, requiring only a few samples for training. ELHSR also achieves orders-of-magnitude efficiency improvement with significantly less time and fewer FLOPs per sample than baseline reward models. Moreover, ELHSR exhibits robust performance even when trained only on logits, extending its applicability to some closed-source LLMs. In addition, ELHSR can also be combined with traditional reward models to achieve additional performance gains. △ Less

Submitted 18 May, 2025; originally announced May 2025.

arXiv:2503.04981 [pdf, other]

Topology-Aware Conformal Prediction for Stream Networks

Authors: Jifan Zhang, Fangxin Wang, Philip S. Yu, Kaize Ding, Shixiang Zhu

Abstract: Stream networks, a unique class of spatiotemporal graphs, exhibit complex directional flow constraints and evolving dependencies, making uncertainty quantification a critical yet challenging task. Traditional conformal prediction methods struggle in this setting due to the need for joint predictions across multiple interdependent locations and the intricate spatio-temporal dependencies inherent in… ▽ More Stream networks, a unique class of spatiotemporal graphs, exhibit complex directional flow constraints and evolving dependencies, making uncertainty quantification a critical yet challenging task. Traditional conformal prediction methods struggle in this setting due to the need for joint predictions across multiple interdependent locations and the intricate spatio-temporal dependencies inherent in stream networks. Existing approaches either neglect dependencies, leading to overly conservative predictions, or rely solely on data-driven estimations, failing to capture the rich topological structure of the network. To address these challenges, we propose Spatio-Temporal Adaptive Conformal Inference (\texttt{STACI}), a novel framework that integrates network topology and temporal dynamics into the conformal prediction framework. \texttt{STACI} introduces a topology-aware nonconformity score that respects directional flow constraints and dynamically adjusts prediction sets to account for temporal distributional shifts. We provide theoretical guarantees on the validity of our approach and demonstrate its superior performance on both synthetic and real-world datasets. Our results show that \texttt{STACI} effectively balances prediction efficiency and coverage, outperforming existing conformal prediction methods for stream networks. △ Less

Submitted 6 March, 2025; originally announced March 2025.

Comments: 16 pages, 6 figures

arXiv:2502.05155 [pdf, other]

Deep Dynamic Probabilistic Canonical Correlation Analysis

Authors: Shiqin Tang, Shujian Yu, Yining Dong, S. Joe Qin

Abstract: This paper presents Deep Dynamic Probabilistic Canonical Correlation Analysis (D2PCCA), a model that integrates deep learning with probabilistic modeling to analyze nonlinear dynamical systems. Building on the probabilistic extensions of Canonical Correlation Analysis (CCA), D2PCCA captures nonlinear latent dynamics and supports enhancements such as KL annealing for improved convergence and normal… ▽ More This paper presents Deep Dynamic Probabilistic Canonical Correlation Analysis (D2PCCA), a model that integrates deep learning with probabilistic modeling to analyze nonlinear dynamical systems. Building on the probabilistic extensions of Canonical Correlation Analysis (CCA), D2PCCA captures nonlinear latent dynamics and supports enhancements such as KL annealing for improved convergence and normalizing flows for a more flexible posterior approximation. D2PCCA naturally extends to multiple observed variables, making it a versatile tool for encoding prior knowledge about sequential datasets and providing a probabilistic understanding of the system's dynamics. Experimental validation on real financial datasets demonstrates the effectiveness of D2PCCA and its extensions in capturing latent dynamics. △ Less

Submitted 7 February, 2025; originally announced February 2025.

Comments: accepted by ICASSP-25, code is available at \url{https://github.com/marcusstang/D2PCCA}

arXiv:2501.16768 [pdf, other]

Towards the Generalization of Multi-view Learning: An Information-theoretical Analysis

Authors: Wen Wen, Tieliang Gong, Yuxin Dong, Shujian Yu, Weizhan Zhang

Abstract: Multiview learning has drawn widespread attention for its efficacy in leveraging cross-view consensus and complementarity information to achieve a comprehensive representation of data. While multi-view learning has undergone vigorous development and achieved remarkable success, the theoretical understanding of its generalization behavior remains elusive. This paper aims to bridge this gap by devel… ▽ More Multiview learning has drawn widespread attention for its efficacy in leveraging cross-view consensus and complementarity information to achieve a comprehensive representation of data. While multi-view learning has undergone vigorous development and achieved remarkable success, the theoretical understanding of its generalization behavior remains elusive. This paper aims to bridge this gap by developing information-theoretic generalization bounds for multi-view learning, with a particular focus on multi-view reconstruction and classification tasks. Our bounds underscore the importance of capturing both consensus and complementary information from multiple different views to achieve maximally disentangled representations. These results also indicate that applying the multi-view information bottleneck regularizer is beneficial for satisfactory generalization performance. Additionally, we derive novel data-dependent bounds under both leave-one-out and supersample settings, yielding computational tractable and tighter bounds. In the interpolating regime, we further establish the fast-rate bound for multi-view learning, exhibiting a faster convergence rate compared to conventional square-root bounds. Numerical results indicate a strong correlation between the true generalization gap and the derived bounds across various learning scenarios. △ Less

Submitted 28 January, 2025; originally announced January 2025.

arXiv:2412.07446 [pdf, ps, other]

A Causal World Model Underlying Next Token Prediction: Exploring GPT in a Controlled Environment

Authors: Raanan Y. Rohekar, Yaniv Gurwicz, Sungduk Yu, Estelle Aflalo, Vasudev Lal

Abstract: Are generative pre-trained transformer (GPT) models, trained only to predict the next token, implicitly learning a world model from which sequences are generated one token at a time? We address this question by deriving a causal interpretation of the attention mechanism in GPT and presenting a causal world model that arises from this interpretation. Furthermore, we propose that GPT models, at infe… ▽ More Are generative pre-trained transformer (GPT) models, trained only to predict the next token, implicitly learning a world model from which sequences are generated one token at a time? We address this question by deriving a causal interpretation of the attention mechanism in GPT and presenting a causal world model that arises from this interpretation. Furthermore, we propose that GPT models, at inference time, can be utilized for zero-shot causal structure learning for input sequences, and introduce a corresponding confidence score. Empirical tests were conducted in controlled environments using the setups of the Othello and Chess strategy games. A GPT, pre-trained on real-world games played with the intention of winning, was tested on out-of-distribution synthetic data consisting of sequences of random legal moves. We find that the GPT model is likely to generate legal next moves for out-of-distribution sequences for which a causal structure is encoded in the attention mechanism with high confidence. In cases where it generates illegal moves, it also fails to capture a causal structure. △ Less

Submitted 6 July, 2025; v1 submitted 10 December, 2024; originally announced December 2024.

Comments: International Conference on Machine Learning (ICML), 2025

arXiv:2412.05783 [pdf, other]

Two-way Deconfounder for Off-policy Evaluation in Causal Reinforcement Learning

Authors: Shuguang Yu, Shuxing Fang, Ruixin Peng, Zhengling Qi, Fan Zhou, Chengchun Shi

Abstract: This paper studies off-policy evaluation (OPE) in the presence of unmeasured confounders. Inspired by the two-way fixed effects regression model widely used in the panel data literature, we propose a two-way unmeasured confounding assumption to model the system dynamics in causal reinforcement learning and develop a two-way deconfounder algorithm that devises a neural tensor network to simultaneou… ▽ More This paper studies off-policy evaluation (OPE) in the presence of unmeasured confounders. Inspired by the two-way fixed effects regression model widely used in the panel data literature, we propose a two-way unmeasured confounding assumption to model the system dynamics in causal reinforcement learning and develop a two-way deconfounder algorithm that devises a neural tensor network to simultaneously learn both the unmeasured confounders and the system dynamics, based on which a model-based estimator can be constructed for consistent policy value estimation. We illustrate the effectiveness of the proposed estimator through theoretical results and numerical experiments. △ Less

Submitted 7 December, 2024; originally announced December 2024.

arXiv:2411.00109 [pdf, other]

Prospective Learning: Learning for a Dynamic Future

Authors: Ashwin De Silva, Rahul Ramesh, Rubing Yang, Siyu Yu, Joshua T Vogelstein, Pratik Chaudhari

Abstract: In real-world applications, the distribution of the data, and our goals, evolve over time. The prevailing theoretical framework for studying machine learning, namely probably approximately correct (PAC) learning, largely ignores time. As a consequence, existing strategies to address the dynamic nature of data and goals exhibit poor real-world performance. This paper develops a theoretical framewor… ▽ More In real-world applications, the distribution of the data, and our goals, evolve over time. The prevailing theoretical framework for studying machine learning, namely probably approximately correct (PAC) learning, largely ignores time. As a consequence, existing strategies to address the dynamic nature of data and goals exhibit poor real-world performance. This paper develops a theoretical framework called "Prospective Learning" that is tailored for situations when the optimal hypothesis changes over time. In PAC learning, empirical risk minimization (ERM) is known to be consistent. We develop a learner called Prospective ERM, which returns a sequence of predictors that make predictions on future data. We prove that the risk of prospective ERM converges to the Bayes risk under certain assumptions on the stochastic process generating the data. Prospective ERM, roughly speaking, incorporates time as an input in addition to the data. We show that standard ERM as done in PAC learning, without incorporating time, can result in failure to learn when distributions are dynamic. Numerical experiments illustrate that prospective ERM can learn synthetic and visual recognition problems constructed from MNIST and CIFAR-10. Code at https://github.com/neurodata/prolearn. △ Less

Submitted 30 January, 2025; v1 submitted 31 October, 2024; originally announced November 2024.

Comments: Accepted to NeurIPS 2024

arXiv:2408.16963 [pdf, other]

Nonparametric Density Estimation for Data Scattered on Irregular Spatial Domains: A Likelihood-Based Approach Using Bivariate Penalized Spline Smoothing

Authors: Kunal Das, Shan Yu, Guannan Wang, Li Wang

Abstract: Accurately estimating data density is crucial for making informed decisions and modeling in various fields. This paper presents a novel nonparametric density estimation procedure that utilizes bivariate penalized spline smoothing over triangulation for data scattered over irregular spatial domains. The approach is likelihood-based with a regularization term that addresses the roughness of the loga… ▽ More Accurately estimating data density is crucial for making informed decisions and modeling in various fields. This paper presents a novel nonparametric density estimation procedure that utilizes bivariate penalized spline smoothing over triangulation for data scattered over irregular spatial domains. The approach is likelihood-based with a regularization term that addresses the roughness of the logarithm of density based on a second-order differential operator. The proposed method offers greater efficiency and flexibility in estimating density over complex domains and has been theoretically supported by establishing the asymptotic convergence rate under mild natural conditions. Through extensive simulation studies and a real-world application that analyzes motor vehicle theft data from Portland City, Oregon, we demonstrate the advantages of the proposed method over existing techniques detailed in the literature. △ Less

Submitted 26 October, 2024; v1 submitted 29 August, 2024; originally announced August 2024.

arXiv:2408.01861 [pdf, other]

Batch Active Learning in Gaussian Process Regression using Derivatives

Authors: Hon Sum Alec Yu, Christoph Zimmer, Duy Nguyen-Tuong

Abstract: We investigate the use of derivative information for Batch Active Learning in Gaussian Process regression models. The proposed approach employs the predictive covariance matrix for selection of data batches to exploit full correlation of samples. We theoretically analyse our proposed algorithm taking different optimality criteria into consideration and provide empirical comparisons highlighting th… ▽ More We investigate the use of derivative information for Batch Active Learning in Gaussian Process regression models. The proposed approach employs the predictive covariance matrix for selection of data batches to exploit full correlation of samples. We theoretically analyse our proposed algorithm taking different optimality criteria into consideration and provide empirical comparisons highlighting the advantage of incorporating derivatives information. Our results show the effectiveness of our approach across diverse applications. △ Less

Submitted 3 August, 2024; originally announced August 2024.

Comments: 29 pages, 10 figures

arXiv:2406.04690 [pdf, other]

Higher-order Structure Based Anomaly Detection on Attributed Networks

Authors: Xu Yuan, Na Zhou, Shuo Yu, Huafei Huang, Zhikui Chen, Feng Xia

Abstract: Anomaly detection (such as telecom fraud detection and medical image detection) has attracted the increasing attention of people. The complex interaction between multiple entities widely exists in the network, which can reflect specific human behavior patterns. Such patterns can be modeled by higher-order network structures, thus benefiting anomaly detection on attributed networks. However, due to… ▽ More Anomaly detection (such as telecom fraud detection and medical image detection) has attracted the increasing attention of people. The complex interaction between multiple entities widely exists in the network, which can reflect specific human behavior patterns. Such patterns can be modeled by higher-order network structures, thus benefiting anomaly detection on attributed networks. However, due to the lack of an effective mechanism in most existing graph learning methods, these complex interaction patterns fail to be applied in detecting anomalies, hindering the progress of anomaly detection to some extent. In order to address the aforementioned issue, we present a higher-order structure based anomaly detection (GUIDE) method. We exploit attribute autoencoder and structure autoencoder to reconstruct node attributes and higher-order structures, respectively. Moreover, we design a graph attention layer to evaluate the significance of neighbors to nodes through their higher-order structure differences. Finally, we leverage node attribute and higher-order structure reconstruction errors to find anomalies. Extensive experiments on five real-world datasets (i.e., ACM, Citation, Cora, DBLP, and Pubmed) are implemented to verify the effectiveness of GUIDE. Experimental results in terms of ROC-AUC, PR-AUC, and Recall@K show that GUIDE significantly outperforms the state-of-art methods. △ Less

Submitted 7 June, 2024; originally announced June 2024.

arXiv:2405.19978 [pdf, other]

Domain Adaptation with Cauchy-Schwarz Divergence

Authors: Wenzhe Yin, Shujian Yu, Yicong Lin, Jie Liu, Jan-Jakob Sonke, Efstratios Gavves

Abstract: Domain adaptation aims to use training data from one or multiple source domains to learn a hypothesis that can be generalized to a different, but related, target domain. As such, having a reliable measure for evaluating the discrepancy of both marginal and conditional distributions is crucial. We introduce Cauchy-Schwarz (CS) divergence to the problem of unsupervised domain adaptation (UDA). The C… ▽ More Domain adaptation aims to use training data from one or multiple source domains to learn a hypothesis that can be generalized to a different, but related, target domain. As such, having a reliable measure for evaluating the discrepancy of both marginal and conditional distributions is crucial. We introduce Cauchy-Schwarz (CS) divergence to the problem of unsupervised domain adaptation (UDA). The CS divergence offers a theoretically tighter generalization error bound than the popular Kullback-Leibler divergence. This holds for the general case of supervised learning, including multi-class classification and regression. Furthermore, we illustrate that the CS divergence enables a simple estimator on the discrepancy of both marginal and conditional distributions between source and target domains in the representation space, without requiring any distributional assumptions. We provide multiple examples to illustrate how the CS divergence can be conveniently used in both distance metric- or adversarial training-based UDA frameworks, resulting in compelling performance. △ Less

Submitted 30 May, 2024; originally announced May 2024.

Comments: Accepted by UAI-24

arXiv:2405.13535 [pdf, ps, other]

Addressing the Inconsistency in Bayesian Deep Learning via Generalized Laplace Approximation

Authors: Yinsong Chen, Samson S. Yu, Zhong Li, Chee Peng Lim

Abstract: In recent years, inconsistency in Bayesian deep learning has attracted significant attention. Tempered or generalized posterior distributions are frequently employed as direct and effective solutions. Nonetheless, the underlying mechanisms and the effectiveness of generalized posteriors remain active research topics. In this work, we interpret posterior tempering as a correction for model misspeci… ▽ More In recent years, inconsistency in Bayesian deep learning has attracted significant attention. Tempered or generalized posterior distributions are frequently employed as direct and effective solutions. Nonetheless, the underlying mechanisms and the effectiveness of generalized posteriors remain active research topics. In this work, we interpret posterior tempering as a correction for model misspecification via adjustments to the joint probability, and as a recalibration of priors by reducing aleatoric uncertainty. We also identify a unique property of the Laplace approximation: the generalized normalizing constant remains invariant, in contrast to general Bayesian learning, where this constant typically depends on model parameters after generalization. Leveraging this property, we introduce the generalized Laplace approximation, which requires only a simple modification to the Hessian calculation of the regularized loss. This approach provides a flexible and scalable framework for high-quality posterior inference. We evaluate the proposed method on state-of-the-art neural networks and real-world datasets, demonstrating that the generalized Laplace approximation enhances predictive performance. △ Less

Submitted 30 June, 2025; v1 submitted 22 May, 2024; originally announced May 2024.

arXiv:2404.17951 [pdf, other]

Cauchy-Schwarz Divergence Information Bottleneck for Regression

Authors: Shujian Yu, Xi Yu, Sigurd Løkse, Robert Jenssen, Jose C. Principe

Abstract: The information bottleneck (IB) approach is popular to improve the generalization, robustness and explainability of deep neural networks. Essentially, it aims to find a minimum sufficient representation $\mathbf{t}$ by striking a trade-off between a compression term $I(\mathbf{x};\mathbf{t})$ and a prediction term $I(y;\mathbf{t})$, where $I(\cdot;\cdot)$ refers to the mutual information (MI). MI… ▽ More The information bottleneck (IB) approach is popular to improve the generalization, robustness and explainability of deep neural networks. Essentially, it aims to find a minimum sufficient representation $\mathbf{t}$ by striking a trade-off between a compression term $I(\mathbf{x};\mathbf{t})$ and a prediction term $I(y;\mathbf{t})$, where $I(\cdot;\cdot)$ refers to the mutual information (MI). MI is for the IB for the most part expressed in terms of the Kullback-Leibler (KL) divergence, which in the regression case corresponds to prediction based on mean squared error (MSE) loss with Gaussian assumption and compression approximated by variational inference. In this paper, we study the IB principle for the regression problem and develop a new way to parameterize the IB with deep neural networks by exploiting favorable properties of the Cauchy-Schwarz (CS) divergence. By doing so, we move away from MSE-based regression and ease estimation by avoiding variational approximations or distributional assumptions. We investigate the improved generalization ability of our proposed CS-IB and demonstrate strong adversarial robustness guarantees. We demonstrate its superior performance on six real-world regression tasks over other popular deep IB approaches. We additionally observe that the solutions discovered by CS-IB always achieve the best trade-off between prediction accuracy and compression ratio in the information plane. The code is available at \url{https://github.com/SJYuCNEL/Cauchy-Schwarz-Information-Bottleneck}. △ Less

Submitted 27 April, 2024; originally announced April 2024.

Comments: accepted by ICLR-24, project page: \url{https://github.com/SJYuCNEL/Cauchy-Schwarz-Information-Bottleneck}

arXiv:2404.11579 [pdf, other]

Spatial Heterogeneous Additive Partial Linear Model: A Joint Approach of Bivariate Spline and Forest Lasso

Authors: Xin Zhang, Shan Yu, Zhengyuan Zhu, Xin Wang

Abstract: Identifying spatial heterogeneous patterns has attracted a surge of research interest in recent years, due to its important applications in various scientific and engineering fields. In practice the spatially heterogeneous components are often mixed with components which are spatially smooth, making the task of identifying the heterogeneous regions more challenging. In this paper, we develop an ef… ▽ More Identifying spatial heterogeneous patterns has attracted a surge of research interest in recent years, due to its important applications in various scientific and engineering fields. In practice the spatially heterogeneous components are often mixed with components which are spatially smooth, making the task of identifying the heterogeneous regions more challenging. In this paper, we develop an efficient clustering approach to identify the model heterogeneity of the spatial additive partial linear model. Specifically, we aim to detect the spatially contiguous clusters based on the regression coefficients while introducing a spatially varying intercept to deal with the smooth spatial effect. On the one hand, to approximate the spatial varying intercept, we use the method of bivariate spline over triangulation, which can effectively handle the data from a complex domain. On the other hand, a novel fusion penalty termed the forest lasso is proposed to reveal the spatial clustering pattern. Our proposed fusion penalty has advantages in both the estimation and computation efficiencies when dealing with large spatial data. Theoretically properties of our estimator are established, and simulation results show that our approach can achieve more accurate estimation with a limited computation cost compared with the existing approaches. To illustrate its practical use, we apply our approach to analyze the spatial pattern of the relationship between land surface temperature measured by satellites and air temperature measured by ground stations in the United States. △ Less

Submitted 3 May, 2024; v1 submitted 17 April, 2024; originally announced April 2024.

arXiv:2403.07185 [pdf, other]

Uncertainty in Graph Neural Networks: A Survey

Authors: Fangxin Wang, Yuqing Liu, Kay Liu, Yibo Wang, Sourav Medya, Philip S. Yu

Abstract: Graph Neural Networks (GNNs) have been extensively used in various real-world applications. However, the predictive uncertainty of GNNs stemming from diverse sources such as inherent randomness in data and model training errors can lead to unstable and erroneous predictions. Therefore, identifying, quantifying, and utilizing uncertainty are essential to enhance the performance of the model for the… ▽ More Graph Neural Networks (GNNs) have been extensively used in various real-world applications. However, the predictive uncertainty of GNNs stemming from diverse sources such as inherent randomness in data and model training errors can lead to unstable and erroneous predictions. Therefore, identifying, quantifying, and utilizing uncertainty are essential to enhance the performance of the model for the downstream tasks as well as the reliability of the GNN predictions. This survey aims to provide a comprehensive overview of the GNNs from the perspective of uncertainty with an emphasis on its integration in graph learning. We compare and summarize existing graph uncertainty theory and methods, alongside the corresponding downstream tasks. Thereby, we bridge the gap between theory and practice, meanwhile connecting different GNN communities. Moreover, our work provides valuable insights into promising directions in this field. △ Less

Submitted 8 March, 2025; v1 submitted 11 March, 2024; originally announced March 2024.

Comments: 14 main pages, 4 figures, 1 table

Journal ref: Transactions on Machine Learning Research (11/2024)

arXiv:2403.06302 [pdf, other]

Nonparametric Automatic Differentiation Variational Inference with Spline Approximation

Authors: Yuda Shao, Shan Yu, Tianshu Feng

Abstract: Automatic Differentiation Variational Inference (ADVI) is efficient in learning probabilistic models. Classic ADVI relies on the parametric approach to approximate the posterior. In this paper, we develop a spline-based nonparametric approximation approach that enables flexible posterior approximation for distributions with complicated structures, such as skewness, multimodality, and bounded suppo… ▽ More Automatic Differentiation Variational Inference (ADVI) is efficient in learning probabilistic models. Classic ADVI relies on the parametric approach to approximate the posterior. In this paper, we develop a spline-based nonparametric approximation approach that enables flexible posterior approximation for distributions with complicated structures, such as skewness, multimodality, and bounded support. Compared with widely-used nonparametric variational inference methods, the proposed method is easy to implement and adaptive to various data structures. By adopting the spline approximation, we derive a lower bound of the importance weighted autoencoder and establish the asymptotic consistency. Experiments demonstrate the efficiency of the proposed method in approximating complex posterior distributions and improving the performance of generative models with incomplete data. △ Less

Submitted 10 March, 2024; originally announced March 2024.

arXiv:2403.05644 [pdf, other]

TSSS: A Novel Triangulated Spherical Spline Smoothing for Surface-based Data

Authors: Zhiling Gu, Shan Yu, Guannan Wang, Ming-Jun Lai, Li Wang

Abstract: Surface-based data is commonly observed in diverse practical applications spanning various fields. In this paper, we introduce a novel nonparametric method to discover the underlying signals from data distributed on complex surface-based domains. Our approach involves a penalized spline estimator defined on a triangulation of surface patches, which enables effective signal extraction and recovery.… ▽ More Surface-based data is commonly observed in diverse practical applications spanning various fields. In this paper, we introduce a novel nonparametric method to discover the underlying signals from data distributed on complex surface-based domains. Our approach involves a penalized spline estimator defined on a triangulation of surface patches, which enables effective signal extraction and recovery. The proposed method offers several advantages over existing methods, including superior handling of "leakage" or "boundary effects" over complex domains, enhanced computational efficiency, and potential applications in analyzing sparse and irregularly distributed data on complex objects. We provide rigorous theoretical guarantees for the proposed method, including convergence rates of the estimator in both the $L_2$ and supremum norms, as well as the asymptotic normality of the estimator. We also demonstrate that the convergence rates achieved by our estimation method are optimal within the framework of nonparametric estimation. Furthermore, we introduce a bootstrap method to quantify the uncertainty associated with the proposed estimators accurately. The superior performance of the proposed method is demonstrated through simulation experiments and data applications on cortical surface functional magnetic resonance imaging data and oceanic near-surface atmospheric data. △ Less

Submitted 8 March, 2024; originally announced March 2024.

Comments: 56 pages, 16 figures

MSC Class: 62G05; 62G08

arXiv:2401.09028 [pdf, other]

A Novel Interpretable Fusion Analytic Framework for Investigating Functional Brain Connectivity Differences in Cognitive Impairments

Authors: Yeseul Jeon, Jeong-Jae Kim, SuMin Yu, Junggu Choi, Sanghoon Han

Abstract: Functional magnetic resonance imaging (fMRI) data is characterized by its complexity and high--dimensionality, encompassing signals from various regions of interests (ROIs) that exhibit intricate correlations. Analyzing fMRI data directly proves challenging due to its intricate structure. Nevertheless, ROIs convey crucial information about brain activities through their connections, offering insig… ▽ More Functional magnetic resonance imaging (fMRI) data is characterized by its complexity and high--dimensionality, encompassing signals from various regions of interests (ROIs) that exhibit intricate correlations. Analyzing fMRI data directly proves challenging due to its intricate structure. Nevertheless, ROIs convey crucial information about brain activities through their connections, offering insights into distinctive brain activity characteristics between different groups. To address this, we propose a cutting-edge interpretable fusion analytic framework that facilitates the identification and understanding of ROI connectivity disparities between two groups, thereby revealing their unique features. Our novel approach encompasses three key steps. Firstly, we construct ROI functional connectivity networks (FCNs) to effectively manage fMRI data. Secondly, employing the FCNs, we utilize a self--attention deep learning model for binary classification, generating an attention distribution that encodes group differences. Lastly, we employ a latent space item-response model to extract group representative ROI features, visualizing these features on the group summary FCNs. We validate the effectiveness of our framework by analyzing four types of cognitive impairments, showcasing its capability to identify significant ROIs contributing to the differences between the two disease groups. This novel interpretable fusion analytic framework holds immense potential for advancing our understanding of cognitive impairments and could pave the way for more targeted therapeutic interventions. △ Less

Submitted 17 January, 2024; originally announced January 2024.

Comments: arXiv admin note: text overlap with arXiv:2207.01581

arXiv:2311.05339 [pdf, other]

An iterative algorithm for high-dimensional linear models with both sparse and non-sparse structures

Authors: Shun Yu, Yuehan Yang

Abstract: Numerous practical medical problems often involve data that possess a combination of both sparse and non-sparse structures. Traditional penalized regularizations techniques, primarily designed for promoting sparsity, are inadequate to capture the optimal solutions in such scenarios. To address these challenges, this paper introduces a novel algorithm named Non-sparse Iteration (NSI). The NSI algor… ▽ More Numerous practical medical problems often involve data that possess a combination of both sparse and non-sparse structures. Traditional penalized regularizations techniques, primarily designed for promoting sparsity, are inadequate to capture the optimal solutions in such scenarios. To address these challenges, this paper introduces a novel algorithm named Non-sparse Iteration (NSI). The NSI algorithm allows for the existence of both sparse and non-sparse structures and estimates them simultaneously and accurately. We provide theoretical guarantees that the proposed algorithm converges to the oracle solution and achieves the optimal rate for the upper bound of the $l_2$-norm error. Through simulations and practical applications, NSI consistently exhibits superior statistical performance in terms of estimation accuracy, prediction efficacy, and variable selection compared to several existing methods. The proposed method is also applied to breast cancer data, revealing repeated selection of specific genes for in-depth analysis. △ Less

Submitted 9 November, 2023; originally announced November 2023.

arXiv:2306.15626 [pdf, other]

LeanDojo: Theorem Proving with Retrieval-Augmented Language Models

Authors: Kaiyu Yang, Aidan M. Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan Prenger, Anima Anandkumar

Abstract: Large language models (LLMs) have shown promise in proving formal theorems using proof assistants such as Lean. However, existing methods are difficult to reproduce or build on, due to private code, data, and large compute requirements. This has created substantial barriers to research on machine learning methods for theorem proving. This paper removes these barriers by introducing LeanDojo: an op… ▽ More Large language models (LLMs) have shown promise in proving formal theorems using proof assistants such as Lean. However, existing methods are difficult to reproduce or build on, due to private code, data, and large compute requirements. This has created substantial barriers to research on machine learning methods for theorem proving. This paper removes these barriers by introducing LeanDojo: an open-source Lean playground consisting of toolkits, data, models, and benchmarks. LeanDojo extracts data from Lean and enables interaction with the proof environment programmatically. It contains fine-grained annotations of premises in proofs, providing valuable data for premise selection: a key bottleneck in theorem proving. Using this data, we develop ReProver (Retrieval-Augmented Prover): an LLM-based prover augmented with retrieval for selecting premises from a vast math library. It is inexpensive and needs only one GPU week of training. Our retriever leverages LeanDojo's program analysis capability to identify accessible premises and hard negative examples, which makes retrieval much more effective. Furthermore, we construct a new benchmark consisting of 98,734 theorems and proofs extracted from Lean's math library. It features challenging data split requiring the prover to generalize to theorems relying on novel premises that are never used in training. We use this benchmark for training and evaluation, and experimental results demonstrate the effectiveness of ReProver over non-retrieval baselines and GPT-4. We thus provide the first set of open-source LLM-based theorem provers without any proprietary datasets and release it under a permissive MIT license to facilitate further research. △ Less

Submitted 27 October, 2023; v1 submitted 27 June, 2023; originally announced June 2023.

Comments: Accepted to NeurIPS 2023 (Datasets and Benchmarks Track) as an oral presentation. Data, code, and models available at https://leandojo.org/

arXiv:2305.03555 [pdf, other]

Contrastive Graph Clustering in Curvature Spaces

Authors: Li Sun, Feiyang Wang, Junda Ye, Hao Peng, Philip S. Yu

Abstract: Graph clustering is a longstanding research topic, and has achieved remarkable success with the deep learning methods in recent years. Nevertheless, we observe that several important issues largely remain open. On the one hand, graph clustering from the geometric perspective is appealing but has rarely been touched before, as it lacks a promising space for geometric clustering. On the other hand,… ▽ More Graph clustering is a longstanding research topic, and has achieved remarkable success with the deep learning methods in recent years. Nevertheless, we observe that several important issues largely remain open. On the one hand, graph clustering from the geometric perspective is appealing but has rarely been touched before, as it lacks a promising space for geometric clustering. On the other hand, contrastive learning boosts the deep graph clustering but usually struggles in either graph augmentation or hard sample mining. To bridge this gap, we rethink the problem of graph clustering from geometric perspective and, to the best of our knowledge, make the first attempt to introduce a heterogeneous curvature space to graph clustering problem. Correspondingly, we present a novel end-to-end contrastive graph clustering model named CONGREGATE, addressing geometric graph clustering with Ricci curvatures. To support geometric clustering, we construct a theoretically grounded Heterogeneous Curvature Space where deep representations are generated via the product of the proposed fully Riemannian graph convolutional nets. Thereafter, we train the graph clusters by an augmentation-free reweighted contrastive approach where we pay more attention to both hard negatives and hard positives in our curvature space. Empirical results on real-world graphs show that our model outperforms the state-of-the-art competitors. △ Less

Submitted 5 May, 2023; originally announced May 2023.

Comments: Accepted by IJCAI'23

arXiv:2301.08970 [pdf, other]

The Conditional Cauchy-Schwarz Divergence with Applications to Time-Series Data and Sequential Decision Making

Authors: Shujian Yu, Hongming Li, Sigurd Løkse, Robert Jenssen, José C. Príncipe

Abstract: The Cauchy-Schwarz (CS) divergence was developed by Príncipe et al. in 2000. In this paper, we extend the classic CS divergence to quantify the closeness between two conditional distributions and show that the developed conditional CS divergence can be simply estimated by a kernel density estimator from given samples. We illustrate the advantages (e.g., rigorous faithfulness guarantee, lower compu… ▽ More The Cauchy-Schwarz (CS) divergence was developed by Príncipe et al. in 2000. In this paper, we extend the classic CS divergence to quantify the closeness between two conditional distributions and show that the developed conditional CS divergence can be simply estimated by a kernel density estimator from given samples. We illustrate the advantages (e.g., rigorous faithfulness guarantee, lower computational complexity, higher statistical power, and much more flexibility in a wide range of applications) of our conditional CS divergence over previous proposals, such as the conditional KL divergence and the conditional maximum mean discrepancy. We also demonstrate the compelling performance of conditional CS divergence in two machine learning tasks related to time series data and sequential inference, namely time series clustering and uncertainty-guided exploration for sequential decision making. The code of conditional CS divergence is available at https://github.com/SJYuCNEL/conditional_CS_divergence. △ Less

Submitted 16 March, 2025; v1 submitted 21 January, 2023; originally announced January 2023.

Comments: manuscript is accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. The code is available at \url{https://github.com/SJYuCNEL/conditional_CS_divergence}

arXiv:2301.01642 [pdf, other]

CI-GNN: A Granger Causality-Inspired Graph Neural Network for Interpretable Brain Network-Based Psychiatric Diagnosis

Authors: Kaizhong Zheng, Shujian Yu, Badong Chen

Abstract: There is a recent trend to leverage the power of graph neural networks (GNNs) for brain-network based psychiatric diagnosis, which,in turn, also motivates an urgent need for psychiatrists to fully understand the decision behavior of the used GNNs. However, most of the existing GNN explainers are either post-hoc in which another interpretive model needs to be created to explain a well-trained GNN,… ▽ More There is a recent trend to leverage the power of graph neural networks (GNNs) for brain-network based psychiatric diagnosis, which,in turn, also motivates an urgent need for psychiatrists to fully understand the decision behavior of the used GNNs. However, most of the existing GNN explainers are either post-hoc in which another interpretive model needs to be created to explain a well-trained GNN, or do not consider the causal relationship between the extracted explanation and the decision, such that the explanation itself contains spurious correlations and suffers from weak faithfulness. In this work, we propose a granger causality-inspired graph neural network (CI-GNN), a built-in interpretable model that is able to identify the most influential subgraph (i.e., functional connectivity within brain regions) that is causally related to the decision (e.g., major depressive disorder patients or healthy controls), without the training of an auxillary interpretive network. CI-GNN learns disentangled subgraph-level representations α and \b{eta} that encode, respectively, the causal and noncausal aspects of original graph under a graph variational autoencoder framework, regularized by a conditional mutual information (CMI) constraint. We theoretically justify the validity of the CMI regulation in capturing the causal relationship. We also empirically evaluate the performance of CI-GNN against three baseline GNNs and four state-of-the-art GNN explainers on synthetic data and three large-scale brain disease datasets. We observe that CI-GNN achieves the best performance in a wide range of metrics and provides more reliable and concise explanations which have clinical evidence.The source code and implementation details of CI-GNN are freely available at GitHub repository (https://github.com/ZKZ-Brain/CI-GNN/). △ Less

Submitted 28 January, 2024; v1 submitted 4 January, 2023; originally announced January 2023.

Comments: Manuscript ia accepted by Neural Networks, The source code and implementation details are freely available at GitHub repository (https://github.com/ZKZ-Brain/CI-GNN/). 45 pages, 14 figures

arXiv:2209.12313 [pdf, other]

Random graph matching at Otter's threshold via counting chandeliers

Authors: Cheng Mao, Yihong Wu, Jiaming Xu, Sophie H. Yu

Abstract: We propose an efficient algorithm for graph matching based on similarity scores constructed from counting a certain family of weighted trees rooted at each vertex. For two Erdős-Rényi graphs $\mathcal{G}(n,q)$ whose edges are correlated through a latent vertex correspondence, we show that this algorithm correctly matches all but a vanishing fraction of the vertices with high probability, provided… ▽ More We propose an efficient algorithm for graph matching based on similarity scores constructed from counting a certain family of weighted trees rooted at each vertex. For two Erdős-Rényi graphs $\mathcal{G}(n,q)$ whose edges are correlated through a latent vertex correspondence, we show that this algorithm correctly matches all but a vanishing fraction of the vertices with high probability, provided that $nq\to\infty$ and the edge correlation coefficient $ρ$ satisfies $ρ^2>α\approx 0.338$, where $α$ is Otter's tree-counting constant. Moreover, this almost exact matching can be made exact under an extra condition that is information-theoretically necessary. This is the first polynomial-time graph matching algorithm that succeeds at an explicit constant correlation and applies to both sparse and dense graphs. In comparison, previous methods either require $ρ=1-o(1)$ or are restricted to sparse graphs. The crux of the algorithm is a carefully curated family of rooted trees called chandeliers, which allows effective extraction of the graph correlation from the counts of the same tree while suppressing the undesirable correlation between those of different trees. △ Less

Submitted 13 February, 2023; v1 submitted 25 September, 2022; originally announced September 2022.

arXiv:2207.03099 [pdf, other]

doi 10.1145/3289600.3290981

A State Transition Model for Mobile Notifications via Survival Analysis

Authors: Yiping Yuan, Jing Zhang, Shaunak Chatterjee, Shipeng Yu, Romer Rosales

Abstract: Mobile notifications have become a major communication channel for social networking services to keep users informed and engaged. As more mobile applications push notifications to users, they constantly face decisions on what to send, when and how. A lack of research and methodology commonly leads to heuristic decision making. Many notifications arrive at an inappropriate moment or introduce too m… ▽ More Mobile notifications have become a major communication channel for social networking services to keep users informed and engaged. As more mobile applications push notifications to users, they constantly face decisions on what to send, when and how. A lack of research and methodology commonly leads to heuristic decision making. Many notifications arrive at an inappropriate moment or introduce too many interruptions, failing to provide value to users and spurring users' complaints. In this paper we explore unique features of interactions between mobile notifications and user engagement. We propose a state transition framework to quantitatively evaluate the effectiveness of notifications. Within this framework, we develop a survival model for badging notifications assuming a log-linear structure and a Weibull distribution. Our results show that this model achieves more flexibility for applications and superior prediction accuracy than a logistic regression model. In particular, we provide an online use case on notification delivery time optimization to show how we make better decisions, drive more user engagement, and provide more value to users. △ Less

Submitted 7 July, 2022; originally announced July 2022.

Comments: 9 pages, 7 figures. Published in WSDM 19'

ACM Class: I.2.6

Journal ref: WSDM 2019 Pages 123-131

arXiv:2205.07426 [pdf, other]

Optimal Randomized Approximations for Matrix based Renyi's Entropy

Authors: Yuxin Dong, Tieliang Gong, Shujian Yu, Chen Li

Abstract: The Matrix-based Renyi's entropy enables us to directly measure information quantities from given data without the costly probability density estimation of underlying distributions, thus has been widely adopted in numerous statistical learning and inference tasks. However, exactly calculating this new information quantity requires access to the eigenspectrum of a semi-positive definite (SPD) matri… ▽ More The Matrix-based Renyi's entropy enables us to directly measure information quantities from given data without the costly probability density estimation of underlying distributions, thus has been widely adopted in numerous statistical learning and inference tasks. However, exactly calculating this new information quantity requires access to the eigenspectrum of a semi-positive definite (SPD) matrix $A$ which grows linearly with the number of samples $n$, resulting in a $O(n^3)$ time complexity that is prohibitive for large-scale applications. To address this issue, this paper takes advantage of stochastic trace approximations for matrix-based Renyi's entropy with arbitrary $α\in R^+$ orders, lowering the complexity by converting the entropy approximation to a matrix-vector multiplication problem. Specifically, we develop random approximations for integer order $α$ cases and polynomial series approximations (Taylor and Chebyshev) for non-integer $α$ cases, leading to a $O(n^2sm)$ overall time complexity, where $s,m \ll n$ denote the number of vector queries and the polynomial order respectively. We theoretically establish statistical guarantees for all approximation algorithms and give explicit order of s and m with respect to the approximation error $\varepsilon$, showing optimal convergence rate for both parameters up to a logarithmic factor. Large-scale simulations and real-world applications validate the effectiveness of the developed approximations, demonstrating remarkable speedup with negligible loss in accuracy. △ Less

Submitted 15 May, 2022; originally announced May 2022.

arXiv:2203.05667 [pdf, other]

3D Nanoscale Mapping of Short-Range Order in GeSn Alloys

Authors: Shang Liu, Alejandra Cuervo Covian, Xiaoxin Wang, Cory T. Cline, Austin Akey, Weiling Dong, Shui-Qing Yu, Jifeng Liu

Abstract: GeSn on Si has attracted much research interest due to its tunable direct bandgap for mid-infrared applications. Recently, short-range order (SRO) in GeSn alloys has been theoretically predicted, which profoundly impacts the band structure. However, characterizing SRO in GeSn is challenging. Guided by physics-informed Poisson statistical analyses of Kth-nearest neighbors (KNN) in atom probe tomogr… ▽ More GeSn on Si has attracted much research interest due to its tunable direct bandgap for mid-infrared applications. Recently, short-range order (SRO) in GeSn alloys has been theoretically predicted, which profoundly impacts the band structure. However, characterizing SRO in GeSn is challenging. Guided by physics-informed Poisson statistical analyses of Kth-nearest neighbors (KNN) in atom probe tomography, a new approach is demonstrated here for 3D nanoscale SRO mapping and semi-quantitative strain mapping in GeSn. For GeSn with ~14 at.% Sn, the SRO parameters of Sn-Sn 1NN in 10x10x10 nm$^{3}$ nanocubes can deviate from that of the random alloys by $\pm$15%. The relatively large fluctuation of the SRO parameters contributes to band-edge softening observed optically. Sn-Sn 1NN also tends to be more favored towards the surface, less favored under strain relaxation or tensile strain, while almost independent of local Sn composition. An algorithm based on least square fit of atomic positions further verifies this Poisson-KNN statistical method. Compared to existing macroscopic spectroscopy or electron microscopy techniques, this new APT statistical analysis uniquely offers 3D SRO mapping at nanoscale resolution in a relatively large volume with millions of atoms. It can also be extended to investigate SRO in other alloy systems. △ Less

Submitted 10 March, 2022; originally announced March 2022.

arXiv:2112.13720 [pdf, other]

doi 10.1109/TSP.2022.3233724

Computationally Efficient Approximations for Matrix-based Renyi's Entropy

Authors: Tieliang Gong, Yuxin Dong, Shujian Yu, Bo Dong

Abstract: The recently developed matrix based Renyi's entropy enables measurement of information in data simply using the eigenspectrum of symmetric positive semi definite (PSD) matrices in reproducing kernel Hilbert space, without estimation of the underlying data distribution. This intriguing property makes the new information measurement widely adopted in multiple statistical inference and learning tasks… ▽ More The recently developed matrix based Renyi's entropy enables measurement of information in data simply using the eigenspectrum of symmetric positive semi definite (PSD) matrices in reproducing kernel Hilbert space, without estimation of the underlying data distribution. This intriguing property makes the new information measurement widely adopted in multiple statistical inference and learning tasks. However, the computation of such quantity involves the trace operator on a PSD matrix $G$ to power $α$(i.e., $tr(G^α)$), with a normal complexity of nearly $O(n^3)$, which severely hampers its practical usage when the number of samples (i.e., $n$) is large. In this work, we present computationally efficient approximations to this new entropy functional that can reduce its complexity to even significantly less than $O(n^2)$. To this end, we leverage the recent progress on Randomized Numerical Linear Algebra, developing Taylor, Chebyshev and Lanczos approximations to $tr(G^α)$ for arbitrary values of $α$ by converting it into matrix-vector multiplications problem. We also establish the connection between the matrix-based Renyi's entropy and PSD matrix approximation, which enables exploiting both clustering and block low-rank structure of $G$ to further reduce the computational cost. We theoretically provide approximation accuracy guarantees and illustrate the properties of different approximations. Large-scale experimental evaluations on both synthetic and real-world data corroborate our theoretical findings, showing promising speedup with negligible loss in accuracy. △ Less

Submitted 9 January, 2023; v1 submitted 27 December, 2021; originally announced December 2021.

arXiv:2110.11816 [pdf, other]

Testing network correlation efficiently via counting trees

Authors: Cheng Mao, Yihong Wu, Jiaming Xu, Sophie H. Yu

Abstract: We propose a new procedure for testing whether two networks are edge-correlated through some latent vertex correspondence. The test statistic is based on counting the co-occurrences of signed trees for a family of non-isomorphic trees. When the two networks are Erdős-Rényi random graphs $\mathcal{G}(n,q)$ that are either independent or correlated with correlation coefficient $ρ$, our test runs in… ▽ More We propose a new procedure for testing whether two networks are edge-correlated through some latent vertex correspondence. The test statistic is based on counting the co-occurrences of signed trees for a family of non-isomorphic trees. When the two networks are Erdős-Rényi random graphs $\mathcal{G}(n,q)$ that are either independent or correlated with correlation coefficient $ρ$, our test runs in $n^{2+o(1)}$ time and succeeds with high probability as $n\to\infty$, provided that $n\min\{q,1-q\} \ge n^{-o(1)}$ and $ρ^2>α\approx 0.338$, where $α$ is Otter's constant so that the number of unlabeled trees with $K$ edges grows as $(1/α)^K$. This significantly improves the prior work in terms of statistical accuracy, running time, and graph sparsity. △ Less

Submitted 1 April, 2022; v1 submitted 22 October, 2021; originally announced October 2021.

arXiv:2110.06057 [pdf, other]

Gated Information Bottleneck for Generalization in Sequential Environments

Authors: Francesco Alesiani, Shujian Yu, Xi Yu

Abstract: Deep neural networks suffer from poor generalization to unseen environments when the underlying data distribution is different from that in the training set. By learning minimum sufficient representations from training data, the information bottleneck (IB) approach has demonstrated its effectiveness to improve generalization in different AI applications. In this work, we propose a new neural netwo… ▽ More Deep neural networks suffer from poor generalization to unseen environments when the underlying data distribution is different from that in the training set. By learning minimum sufficient representations from training data, the information bottleneck (IB) approach has demonstrated its effectiveness to improve generalization in different AI applications. In this work, we propose a new neural network-based IB approach, termed gated information bottleneck (GIB), that dynamically drops spurious correlations and progressively selects the most task-relevant features across different environments by a trainable soft mask (on raw features). GIB enjoys a simple and tractable objective, without any variational approximation or distributional assumption. We empirically demonstrate the superiority of GIB over other popular neural network-based IB approaches in adversarial robustness and out-of-distribution (OOD) detection. Meanwhile, we also establish the connection between IB theory and invariant causal representation learning, and observed that GIB demonstrates appealing performance when different environments arrive sequentially, a more practical scenario where invariant risk minimization (IRM) fails. Code of GIB is available at https://github.com/falesiani/GIB △ Less

Submitted 12 October, 2021; originally announced October 2021.

Comments: manuscript accepted by IEEE ICDM-21 (regular papers), code is available at https://github.com/falesiani/GIB

arXiv:2110.05794 [pdf, other]

Information Theoretic Structured Generative Modeling

Authors: Bo Hu, Shujian Yu, Jose C. Principe

Abstract: Rényi's information provides a theoretical foundation for tractable and data-efficient non-parametric density estimation, based on pair-wise evaluations in a reproducing kernel Hilbert space (RKHS). This paper extends this framework to parametric probabilistic modeling, motivated by the fact that Rényi's information can be estimated in closed-form for Gaussian mixtures. Based on this special conne… ▽ More Rényi's information provides a theoretical foundation for tractable and data-efficient non-parametric density estimation, based on pair-wise evaluations in a reproducing kernel Hilbert space (RKHS). This paper extends this framework to parametric probabilistic modeling, motivated by the fact that Rényi's information can be estimated in closed-form for Gaussian mixtures. Based on this special connection, a novel generative model framework called the structured generative model (SGM) is proposed that makes straightforward optimization possible, because costs are scale-invariant, avoiding high gradient variance while imposing less restrictions on absolute continuity, which is a huge advantage in parametric information theoretic optimization. The implementation employs a single neural network driven by an orthonormal input appended to a single white noise source adapted to learn an infinite Gaussian mixture model (IMoG), which provides an empirically tractable model distribution in low dimensions. To train SGM, we provide three novel variational cost functions, based on Rényi's second-order entropy and divergence, to implement minimization of cross-entropy, minimization of variational representations of $f$-divergence, and maximization of the evidence lower bound (conditional probability). We test the framework for estimation of mutual information and compare the results with the mutual information neural estimation (MINE), for density estimation, for conditional probability estimation in Markov models as well as for training adversarial networks. Our preliminary results show that SGM significantly improves MINE estimation in terms of data efficiency and variance, conventional and variational Gaussian mixture models, as well as the performance of generative adversarial networks. △ Less

Submitted 7 March, 2022; v1 submitted 12 October, 2021; originally announced October 2021.

arXiv:2109.04671 [pdf, other]

Interaction Models and Generalized Score Matching for Compositional Data

Authors: Shiqing Yu, Mathias Drton, Ali Shojaie

Abstract: Applications such as the analysis of microbiome data have led to renewed interest in statistical methods for compositional data, i.e., multivariate data in the form of probability vectors that contain relative proportions. In particular, there is considerable interest in modeling interactions among such relative proportions. To this end we propose a class of exponential family models that accommod… ▽ More Applications such as the analysis of microbiome data have led to renewed interest in statistical methods for compositional data, i.e., multivariate data in the form of probability vectors that contain relative proportions. In particular, there is considerable interest in modeling interactions among such relative proportions. To this end we propose a class of exponential family models that accommodate general patterns of pairwise interaction while being supported on the probability simplex. Special cases include the family of Dirichlet distributions as well as Aitchison's additive logistic normal distributions. Generally, the distributions we consider have a density that features a difficult to compute normalizing constant. To circumvent this issue, we design effective estimation methods based on generalized versions of score matching. A high-dimensional analysis of our estimation methods shows that the simplex domain is handled as efficiently as previously studied full-dimensional domains. △ Less

Submitted 10 September, 2021; originally announced September 2021.

Comments: 41 pages, 19 figures

arXiv:2108.03531 [pdf, other]

Learning to Transfer with von Neumann Conditional Divergence

Authors: Ammar Shaker, Shujian Yu, Daniel Oñoro-Rubio

Abstract: The similarity of feature representations plays a pivotal role in the success of problems related to domain adaptation. Feature similarity includes both the invariance of marginal distributions and the closeness of conditional distributions given the desired response $y$ (e.g., class labels). Unfortunately, traditional methods always learn such features without fully taking into consideration the… ▽ More The similarity of feature representations plays a pivotal role in the success of problems related to domain adaptation. Feature similarity includes both the invariance of marginal distributions and the closeness of conditional distributions given the desired response $y$ (e.g., class labels). Unfortunately, traditional methods always learn such features without fully taking into consideration the information in $y$, which in turn may lead to a mismatch of the conditional distributions or the mix-up of discriminative structures underlying data distributions. In this work, we introduce the recently proposed von Neumann conditional divergence to improve the transferability across multiple domains. We show that this new divergence is differentiable and eligible to easily quantify the functional dependence between features and $y$. Given multiple source tasks, we integrate this divergence to capture discriminative information in $y$ and design novel learning objectives assuming those source tasks are observed either simultaneously or sequentially. In both scenarios, we obtain favorable performance against state-of-the-art methods in terms of smaller generalization error on new tasks and less catastrophic forgetting on source tasks (in the sequential setup). △ Less

Submitted 6 January, 2022; v1 submitted 7 August, 2021; originally announced August 2021.

Comments: Accepted at AAAI2022

arXiv:2108.03336 [pdf, other]

Estimating Graph Dimension with Cross-validated Eigenvalues

Authors: Fan Chen, Sebastien Roch, Karl Rohe, Shuqi Yu

Abstract: In applied multivariate statistics, estimating the number of latent dimensions or the number of clusters is a fundamental and recurring problem. One common diagnostic is the scree plot, which shows the largest eigenvalues of the data matrix; the user searches for a "gap" or "elbow" in the decreasing eigenvalues; unfortunately, these patterns can hide beneath the bias of the sample eigenvalues. Thi… ▽ More In applied multivariate statistics, estimating the number of latent dimensions or the number of clusters is a fundamental and recurring problem. One common diagnostic is the scree plot, which shows the largest eigenvalues of the data matrix; the user searches for a "gap" or "elbow" in the decreasing eigenvalues; unfortunately, these patterns can hide beneath the bias of the sample eigenvalues. This methodological problem is conceptually difficult because, in many situations, there is only enough signal to detect a subset of the $k$ population dimensions/eigenvectors. In this situation, one could argue that the correct choice of $k$ is the number of detectable dimensions. We alleviate these problems with cross-validated eigenvalues. Under a large class of random graph models, without any parametric assumptions, we provide a p-value for each sample eigenvector. It tests the null hypothesis that this sample eigenvector is orthogonal to (i.e., uncorrelated with) the true latent dimensions. This approach naturally adapts to problems where some dimensions are not statistically detectable. In scenarios where all $k$ dimensions can be estimated, we prove that our procedure consistently estimates $k$. In simulations and a data example, the proposed estimator compares favorably to alternative approaches in both computational and statistical performance. △ Less

Submitted 6 August, 2021; originally announced August 2021.

Comments: 53 pages, 12 figures

arXiv:2108.00819 [pdf, other]

Active Learning in Gaussian Process State Space Model

Authors: Hon Sum Alec Yu, Dingling Yao, Christoph Zimmer, Marc Toussaint, Duy Nguyen-Tuong

Abstract: We investigate active learning in Gaussian Process state-space models (GPSSM). Our problem is to actively steer the system through latent states by determining its inputs such that the underlying dynamics can be optimally learned by a GPSSM. In order that the most informative inputs are selected, we employ mutual information as our active learning criterion. In particular, we present two approache… ▽ More We investigate active learning in Gaussian Process state-space models (GPSSM). Our problem is to actively steer the system through latent states by determining its inputs such that the underlying dynamics can be optimally learned by a GPSSM. In order that the most informative inputs are selected, we employ mutual information as our active learning criterion. In particular, we present two approaches for the approximation of mutual information for the GPSSM given latent states. The proposed approaches are evaluated in several physical systems where we actively learn the underlying non-linear dynamics represented by the state-space model. △ Less

Submitted 30 July, 2021; originally announced August 2021.

Comments: European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD) 2021

arXiv:2106.04255 [pdf, other]

Nonparametric Regression for 3D Point Cloud Learning

Authors: Xinyi Li, Shan Yu, Yueying Wang, Guannan Wang, Ming-Jun Lai, Li Wang

Abstract: Over the past two decades, we have seen an exponentially increased amount of point clouds collected with irregular shapes in various areas. Motivated by the importance of solid modeling for point clouds, we develop a novel and efficient smoothing tool based on multivariate splines over the tetrahedral partitions to extract the underlying signal and build up a 3D solid model from the point cloud. T… ▽ More Over the past two decades, we have seen an exponentially increased amount of point clouds collected with irregular shapes in various areas. Motivated by the importance of solid modeling for point clouds, we develop a novel and efficient smoothing tool based on multivariate splines over the tetrahedral partitions to extract the underlying signal and build up a 3D solid model from the point cloud. The proposed smoothing method can denoise or deblur the point cloud effectively and provide a multi-resolution reconstruction of the actual signal. In addition, it can handle sparse and irregularly distributed point clouds and recover the underlying trajectory. The proposed smoothing and interpolation method also provides a natural way of numerosity data reduction. Furthermore, we establish the theoretical guarantees of the proposed method. Specifically, we derive the convergence rate and asymptotic normality of the proposed estimator and illustrate that the convergence rate achieves the optimal nonparametric convergence rate. Through extensive simulation studies and a real data example, we demonstrate the superiority of the proposed method over traditional smoothing methods in terms of estimation accuracy and efficiency of data reduction. △ Less

Submitted 18 February, 2023; v1 submitted 8 June, 2021; originally announced June 2021.

Comments: 64 pages, 16 figures

MSC Class: 62G05; 62G08

arXiv:2106.01431 [pdf, other]

doi 10.5705/ss.202019.0188

Multivariate Spline Estimation and Inference for Image-On-Scalar Regression

Authors: Shan Yu, Guannan Wang, Li Wang, Lijian Yang

Abstract: Motivated by recent data analyses in biomedical imaging studies, we consider a class of image-on-scalar regression models for imaging responses and scalar predictors. We propose using flexible multivariate splines over triangulations to handle the irregular domain of the objects of interest on the images, as well as other characteristics of images. The proposed estimators of the coefficient functi… ▽ More Motivated by recent data analyses in biomedical imaging studies, we consider a class of image-on-scalar regression models for imaging responses and scalar predictors. We propose using flexible multivariate splines over triangulations to handle the irregular domain of the objects of interest on the images, as well as other characteristics of images. The proposed estimators of the coefficient functions are proved to be root-n consistent and asymptotically normal under some regularity conditions. We also provide a consistent and computationally efficient estimator of the covariance function. Asymptotic pointwise confidence intervals and data-driven simultaneous confidence corridors for the coefficient functions are constructed. Our method can simultaneously estimate and make inferences on the coefficient functions while incorporating spatial heterogeneity and spatial correlation. A highly efficient and scalable estimation algorithm is developed. Monte Carlo simulation studies are conducted to examine the finite-sample performance of the proposed method, which is then applied to the spatially normalized positron emission tomography data of the Alzheimer's Disease Neuroimaging Initiative. △ Less

Submitted 2 June, 2021; originally announced June 2021.

arXiv:2105.03058 [pdf, other]

Error-Robust Multi-View Clustering: Progress, Challenges and Opportunities

Authors: Mehrnaz Najafi, Lifang He, Philip S. Yu

Abstract: With recent advances in data collection from multiple sources, multi-view data has received significant attention. In multi-view data, each view represents a different perspective of data. Since label information is often expensive to acquire, multi-view clustering has gained growing interest, which aims to obtain better clustering solution by exploiting complementary and consistent information ac… ▽ More With recent advances in data collection from multiple sources, multi-view data has received significant attention. In multi-view data, each view represents a different perspective of data. Since label information is often expensive to acquire, multi-view clustering has gained growing interest, which aims to obtain better clustering solution by exploiting complementary and consistent information across all views rather than only using an individual view. Due to inevitable sensor failures, data in each view may contain error. Error often exhibits as noise or feature-specific corruptions or outliers. Multi-view data may contain any or combination of these error types. Blindly clustering multi-view data i.e., without considering possible error in view(s) could significantly degrade the performance. The goal of error-robust multi-view clustering is to obtain useful outcome even if the multi-view data is corrupted. Existing error-robust multi-view clustering approaches with explicit error removal formulation can be structured into five broad research categories - sparsity norm based approaches, graph based methods, subspace based learning approaches, deep learning based methods and hybrid approaches, this survey summarizes and reviews recent advances in error-robust clustering for multi-view data. Finally, we highlight the challenges and provide future research opportunities. △ Less

Submitted 7 May, 2021; originally announced May 2021.

arXiv:2104.07295 [pdf, other]

Variational Co-embedding Learning for Attributed Network Clustering

Authors: Shuiqiao Yang, Sunny Verma, Borui Cai, Jiaojiao Jiang, Kun Yu, Fang Chen, Shui Yu

Abstract: Recent works for attributed network clustering utilize graph convolution to obtain node embeddings and simultaneously perform clustering assignments on the embedding space. It is effective since graph convolution combines the structural and attributive information for node embedding learning. However, a major limitation of such works is that the graph convolution only incorporates the attribute in… ▽ More Recent works for attributed network clustering utilize graph convolution to obtain node embeddings and simultaneously perform clustering assignments on the embedding space. It is effective since graph convolution combines the structural and attributive information for node embedding learning. However, a major limitation of such works is that the graph convolution only incorporates the attribute information from the local neighborhood of nodes but fails to exploit the mutual affinities between nodes and attributes. In this regard, we propose a variational co-embedding learning model for attributed network clustering (VCLANC). VCLANC is composed of dual variational auto-encoders to simultaneously embed nodes and attributes. Relying on this, the mutual affinity information between nodes and attributes could be reconstructed from the embedding space and served as extra self-supervised knowledge for representation learning. At the same time, trainable Gaussian mixture model is used as priors to infer the node clustering assignments. To strengthen the performance of the inferred clusters, we use a mutual distance loss on the centers of the Gaussian priors and a clustering assignment hardening loss on the node embeddings. Experimental results on four real-world attributed network datasets demonstrate the effectiveness of the proposed VCLANC for attributed network clustering. △ Less

Submitted 15 April, 2021; originally announced April 2021.

Comments: This manuscript is under review

arXiv:2102.00082 [pdf, ps, other]

Settling the Sharp Reconstruction Thresholds of Random Graph Matching

Authors: Yihong Wu, Jiaming Xu, Sophie H. Yu

Abstract: This paper studies the problem of recovering the hidden vertex correspondence between two edge-correlated random graphs. We focus on the Gaussian model where the two graphs are complete graphs with correlated Gaussian weights and the Erdős-Rényi model where the two graphs are subsampled from a common parent Erdős-Rényi graph $\mathcal{G}(n,p)$. For dense graphs with $p=n^{-o(1)}$, we prove that th… ▽ More This paper studies the problem of recovering the hidden vertex correspondence between two edge-correlated random graphs. We focus on the Gaussian model where the two graphs are complete graphs with correlated Gaussian weights and the Erdős-Rényi model where the two graphs are subsampled from a common parent Erdős-Rényi graph $\mathcal{G}(n,p)$. For dense graphs with $p=n^{-o(1)}$, we prove that there exists a sharp threshold, above which one can correctly match all but a vanishing fraction of vertices and below which correctly matching any positive fraction is impossible, a phenomenon known as the "all-or-nothing" phase transition. Even more strikingly, in the Gaussian setting, above the threshold all vertices can be exactly matched with high probability. In contrast, for sparse Erdős-Rényi graphs with $p=n^{-Θ(1)}$, we show that the all-or-nothing phenomenon no longer holds and we determine the thresholds up to a constant factor. Along the way, we also derive the sharp threshold for exact recovery, sharpening the existing results in Erdős-Rényi graphs. The proof of the negative results builds upon a tight characterization of the mutual information based on the truncated second-moment computation and an "area theorem" that relates the mutual information to the integral of the reconstruction error. The positive results follows from a tight analysis of the maximum likelihood estimator that takes into account the cycle structure of the induced permutation on the edges. △ Less

Submitted 16 February, 2022; v1 submitted 29 January, 2021; originally announced February 2021.

MSC Class: 94A15; 62B10; 68Q87; 05C80; 05C60

arXiv:2101.10160 [pdf, other]

Measuring Dependence with Matrix-based Entropy Functional

Authors: Shujian Yu, Francesco Alesiani, Xi Yu, Robert Jenssen, Jose C. Principe

Abstract: Measuring the dependence of data plays a central role in statistics and machine learning. In this work, we summarize and generalize the main idea of existing information-theoretic dependence measures into a higher-level perspective by the Shearer's inequality. Based on our generalization, we then propose two measures, namely the matrix-based normalized total correlation ($T_α^*$) and the matrix-ba… ▽ More Measuring the dependence of data plays a central role in statistics and machine learning. In this work, we summarize and generalize the main idea of existing information-theoretic dependence measures into a higher-level perspective by the Shearer's inequality. Based on our generalization, we then propose two measures, namely the matrix-based normalized total correlation ($T_α^*$) and the matrix-based normalized dual total correlation ($D_α^*$), to quantify the dependence of multiple variables in arbitrary dimensional space, without explicit estimation of the underlying data distributions. We show that our measures are differentiable and statistically more powerful than prevalent ones. We also show the impact of our measures in four different machine learning problems, namely the gene regulatory network inference, the robust machine learning under covariate shift and non-Gaussian noises, the subspace outlier detection, and the understanding of the learning dynamics of convolutional neural networks (CNNs), to demonstrate their utilities, advantages, as well as implications to those problems. Code of our dependence measure is available at: https://bit.ly/AAAI-dependence △ Less

Submitted 25 January, 2021; originally announced January 2021.

Comments: Accepted at AAAI-21. An interpretable and differentiable dependence (or independence) measure that can be used to 1) train deep network under covariate shift and non-Gaussian noise; 2) implement a deep deterministic information bottleneck; and 3) understand the dynamics of learning of CNN. Code available at https://bit.ly/AAAI-dependence

arXiv:2011.01272

Modular-Relatedness for Continual Learning

Authors: Ammar Shaker, Shujian Yu, Francesco Alesiani

Abstract: In this paper, we propose a continual learning (CL) technique that is beneficial to sequential task learners by improving their retained accuracy and reducing catastrophic forgetting. The principal target of our approach is the automatic extraction of modular parts of the neural network and then estimating the relatedness between the tasks given these modular components. This technique is applicab… ▽ More In this paper, we propose a continual learning (CL) technique that is beneficial to sequential task learners by improving their retained accuracy and reducing catastrophic forgetting. The principal target of our approach is the automatic extraction of modular parts of the neural network and then estimating the relatedness between the tasks given these modular components. This technique is applicable to different families of CL methods such as regularization-based (e.g., the Elastic Weight Consolidation) or the rehearsal-based (e.g., the Gradient Episodic Memory) approaches where episodic memory is needed. Empirical results demonstrate remarkable performance gain (in terms of robustness to forgetting) for methods such as EWC and GEM based on our technique, especially when the memory budget is very limited. △ Less

Submitted 17 January, 2021; v1 submitted 2 November, 2020; originally announced November 2020.

Comments: We realized one conclusion in the submission is erroneous and disconnected from the results shown in one theorem is. We decide to withdraw the current version to avoid misleading conclusion

arXiv:2009.12040 [pdf, other]

doi 10.1109/TKDE.2020.3002567

Fairness in Semi-supervised Learning: Unlabeled Data Help to Reduce Discrimination

Authors: Tao Zhang, Tianqing Zhu, Jing Li, Mengde Han, Wanlei Zhou, Philip S. Yu

Abstract: A growing specter in the rise of machine learning is whether the decisions made by machine learning models are fair. While research is already underway to formalize a machine-learning concept of fairness and to design frameworks for building fair models with sacrifice in accuracy, most are geared toward either supervised or unsupervised learning. Yet two observations inspired us to wonder whether… ▽ More A growing specter in the rise of machine learning is whether the decisions made by machine learning models are fair. While research is already underway to formalize a machine-learning concept of fairness and to design frameworks for building fair models with sacrifice in accuracy, most are geared toward either supervised or unsupervised learning. Yet two observations inspired us to wonder whether semi-supervised learning might be useful to solve discrimination problems. First, previous study showed that increasing the size of the training set may lead to a better trade-off between fairness and accuracy. Second, the most powerful models today require an enormous of data to train which, in practical terms, is likely possible from a combination of labeled and unlabeled data. Hence, in this paper, we present a framework of fair semi-supervised learning in the pre-processing phase, including pseudo labeling to predict labels for unlabeled data, a re-sampling method to obtain multiple fair datasets and lastly, ensemble learning to improve accuracy and decrease discrimination. A theoretical decomposition analysis of bias, variance and noise highlights the different sources of discrimination and the impact they have on fairness in semi-supervised learning. A set of experiments on real-world and synthetic datasets show that our method is able to use unlabeled data to achieve a better trade-off between accuracy and discrimination. △ Less

Submitted 25 September, 2020; originally announced September 2020.

Comments: This paper has been published in IEEE Transactions on Knowledge and Data Engineering

arXiv:2009.11428 [pdf, other]

Generalized Score Matching for General Domains

Authors: Shiqing Yu, Mathias Drton, Ali Shojaie

Abstract: Estimation of density functions supported on general domains arises when the data is naturally restricted to a proper subset of the real space. This problem is complicated by typically intractable normalizing constants. Score matching provides a powerful tool for estimating densities with such intractable normalizing constants, but as originally proposed is limited to densities on $\mathbb{R}^m$ a… ▽ More Estimation of density functions supported on general domains arises when the data is naturally restricted to a proper subset of the real space. This problem is complicated by typically intractable normalizing constants. Score matching provides a powerful tool for estimating densities with such intractable normalizing constants, but as originally proposed is limited to densities on $\mathbb{R}^m$ and $\mathbb{R}_+^m$. In this paper, we offer a natural generalization of score matching that accommodates densities supported on a very general class of domains. We apply the framework to truncated graphical and pairwise interaction models, and provide theoretical guarantees for the resulting estimators. We also generalize a recently proposed method from bounded to unbounded domains, and empirically demonstrate the advantages of our method. △ Less

Submitted 23 September, 2020; originally announced September 2020.

Comments: 50 pages, 14 figures

arXiv:2009.06190 [pdf, other]

Fairness Constraints in Semi-supervised Learning

Authors: Tao Zhang, Tianqing Zhu, Mengde Han, Jing Li, Wanlei Zhou, Philip S. Yu

Abstract: Fairness in machine learning has received considerable attention. However, most studies on fair learning focus on either supervised learning or unsupervised learning. Very few consider semi-supervised settings. Yet, in reality, most machine learning tasks rely on large datasets that contain both labeled and unlabeled data. One of key issues with fair learning is the balance between fairness and ac… ▽ More Fairness in machine learning has received considerable attention. However, most studies on fair learning focus on either supervised learning or unsupervised learning. Very few consider semi-supervised settings. Yet, in reality, most machine learning tasks rely on large datasets that contain both labeled and unlabeled data. One of key issues with fair learning is the balance between fairness and accuracy. Previous studies arguing that increasing the size of the training set can have a better trade-off. We believe that increasing the training set with unlabeled data may achieve the similar result. Hence, we develop a framework for fair semi-supervised learning, which is formulated as an optimization problem. This includes classifier loss to optimize accuracy, label propagation loss to optimize unlabled data prediction, and fairness constraints over labeled and unlabeled data to optimize the fairness level. The framework is conducted in logistic regression and support vector machines under the fairness metrics of disparate impact and disparate mistreatment. We theoretically analyze the source of discrimination in semi-supervised learning via bias, variance and noise decomposition. Extensive experiments show that our method is able to achieve fair semi-supervised learning, and reach a better trade-off between accuracy and fairness than fair supervised learning. △ Less

Submitted 14 September, 2020; originally announced September 2020.

arXiv:2009.05618 [pdf, other]

Learning an Interpretable Graph Structure in Multi-Task Learning

Authors: Shujian Yu, Francesco Alesiani, Ammar Shaker, Wenzhe Yin

Abstract: We present a novel methodology to jointly perform multi-task learning and infer intrinsic relationship among tasks by an interpretable and sparse graph. Unlike existing multi-task learning methodologies, the graph structure is not assumed to be known a priori or estimated separately in a preprocessing step. Instead, our graph is learned simultaneously with model parameters of each task, thus it re… ▽ More We present a novel methodology to jointly perform multi-task learning and infer intrinsic relationship among tasks by an interpretable and sparse graph. Unlike existing multi-task learning methodologies, the graph structure is not assumed to be known a priori or estimated separately in a preprocessing step. Instead, our graph is learned simultaneously with model parameters of each task, thus it reflects the critical relationship among tasks in the specific prediction problem. We characterize graph structure with its weighted adjacency matrix and show that the overall objective can be optimized alternatively until convergence. We also show that our methodology can be simply extended to a nonlinear form by being embedded into a multi-head radial basis function network (RBFN). Extensive experiments, against six state-of-the-art methodologies, on both synthetic data and real-world applications suggest that our methodology is able to reduce generalization error, and, at the same time, reveal a sparse graph over tasks that is much easier to interpret. △ Less

Submitted 11 September, 2020; originally announced September 2020.

Comments: 11 pages, 7 figures

arXiv:2009.05483 [pdf, other]

Towards Interpretable Multi-Task Learning Using Bilevel Programming

Authors: Francesco Alesiani, Shujian Yu, Ammar Shaker, Wenzhe Yin

Abstract: Interpretable Multi-Task Learning can be expressed as learning a sparse graph of the task relationship based on the prediction performance of the learned models. Since many natural phenomenon exhibit sparse structures, enforcing sparsity on learned models reveals the underlying task relationship. Moreover, different sparsification degrees from a fully connected graph uncover various types of struc… ▽ More Interpretable Multi-Task Learning can be expressed as learning a sparse graph of the task relationship based on the prediction performance of the learned models. Since many natural phenomenon exhibit sparse structures, enforcing sparsity on learned models reveals the underlying task relationship. Moreover, different sparsification degrees from a fully connected graph uncover various types of structures, like cliques, trees, lines, clusters or fully disconnected graphs. In this paper, we propose a bilevel formulation of multi-task learning that induces sparse graphs, thus, revealing the underlying task relationships, and an efficient method for its computation. We show empirically how the induced sparse graph improves the interpretability of the learned models and their relationship on synthetic and real data, without sacrificing generalization performance. Code at https://bit.ly/GraphGuidedMTL △ Less

Submitted 11 September, 2020; originally announced September 2020.

Comments: Manuscript accepted at ECML PKDD 2020

arXiv:2009.03506 [pdf]

High-throughput relation extraction algorithm development associating knowledge articles and electronic health records

Authors: Yucong Lin, Keming Lu, Yulin Chen, Chuan Hong, Sheng Yu

Abstract: Objective: Medical relations are the core components of medical knowledge graphs that are needed for healthcare artificial intelligence. However, the requirement of expert annotation by conventional algorithm development processes creates a major bottleneck for mining new relations. In this paper, we present Hi-RES, a framework for high-throughput relation extraction algorithm development. We also… ▽ More Objective: Medical relations are the core components of medical knowledge graphs that are needed for healthcare artificial intelligence. However, the requirement of expert annotation by conventional algorithm development processes creates a major bottleneck for mining new relations. In this paper, we present Hi-RES, a framework for high-throughput relation extraction algorithm development. We also show that combining knowledge articles with electronic health records (EHRs) significantly increases the classification accuracy. Methods: We use relation triplets obtained from structured databases and semistructured webpages to label sentences from target corpora as positive training samples. Two methods are also provided for creating improved negative samples by combining positive samples with naïve negative samples. We propose a common model that summarizes sentence information using large-scale pretrained language models and multi-instance attention, which then joins with the concept embeddings trained from the EHRs for relation prediction. Results: We apply the Hi-RES framework to develop classification algorithms for disorder-disorder relations and disorder-location relations. Millions of sentences are created as training data. Using pretrained language models and EHR-based embeddings individually provides considerable accuracy increases over those of previous models. Joining them together further tremendously increases the accuracy to 0.947 and 0.998 for the two sets of relations, respectively, which are 10-17 percentage points higher than those of previous models. Conclusion: Hi-RES is an efficient framework for achieving high-throughput and accurate relation extraction algorithm development. △ Less

Submitted 7 September, 2020; originally announced September 2020.

Showing 1–50 of 144 results for author: Yu, S