-
EcoSphere: A Decision-Support Tool for Automated Carbon Emission and Cost Optimization in Sustainable Urban Development
Authors:
Siavash Ghorbany,
Ming Hu,
Siyuan Yao,
Matthew Sisk,
Chaoli Wang
Abstract:
The construction industry is a major contributor to global greenhouse gas emissions, with embodied carbon being a key component. This study develops EcoSphere, an innovative software designed to evaluate and balance embodied and operational carbon emissions with construction and environmental costs in urban planning. Using high-resolution data from the National Structure Inventory, combined with c…
▽ More
The construction industry is a major contributor to global greenhouse gas emissions, with embodied carbon being a key component. This study develops EcoSphere, an innovative software designed to evaluate and balance embodied and operational carbon emissions with construction and environmental costs in urban planning. Using high-resolution data from the National Structure Inventory, combined with computer vision and natural language processing applied to Google Street View and satellite imagery, EcoSphere categorizes buildings by structural and material characteristics with a bottom-up approach, creating a baseline emissions dataset. By simulating policy scenarios and mitigation strategies, EcoSphere provides policymakers and non-experts with actionable insights for sustainable development in cities and provide them with a vision of the environmental and financial results of their decisions. Case studies in Chicago and Indianapolis showcase how EcoSphere aids in assessing policy impacts on carbon emissions and costs, supporting data-driven progress toward carbon neutrality.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Aioli: A Unified Optimization Framework for Language Model Data Mixing
Authors:
Mayee F. Chen,
Michael Y. Hu,
Nicholas Lourie,
Kyunghyun Cho,
Christopher Ré
Abstract:
Language model performance depends on identifying the optimal mixture of data groups to train on (e.g., law, code, math). Prior work has proposed a diverse set of methods to efficiently learn mixture proportions, ranging from fitting regression models over training runs to dynamically updating proportions throughout training. Surprisingly, we find that no existing method consistently outperforms a…
▽ More
Language model performance depends on identifying the optimal mixture of data groups to train on (e.g., law, code, math). Prior work has proposed a diverse set of methods to efficiently learn mixture proportions, ranging from fitting regression models over training runs to dynamically updating proportions throughout training. Surprisingly, we find that no existing method consistently outperforms a simple stratified sampling baseline in terms of average test perplexity. To understand this inconsistency, we unify existing methods into a standard framework, showing they are equivalent to solving a common optimization problem: minimize average loss subject to a method-specific mixing law -- an implicit assumption on the relationship between loss and mixture proportions. This framework suggests that measuring the fidelity of a method's mixing law can offer insights into its performance. Empirically, we find that existing methods set their mixing law parameters inaccurately, resulting in the inconsistent mixing performance we observe. Using this insight, we derive a new online method named Aioli, which directly estimates the mixing law parameters throughout training and uses them to dynamically adjust proportions. Aioli outperforms stratified sampling on 6 out of 6 datasets by an average of 0.27 test perplexity points, whereas existing methods fail to consistently beat stratified sampling, doing up to 6.9 points worse. Moreover, in a practical setting where proportions are learned on shorter runs due to computational constraints, Aioli can dynamically adjust these proportions over the full training run, consistently improving performance over existing methods by up to 12.012 test perplexity points.
△ Less
Submitted 20 April, 2025; v1 submitted 8 November, 2024;
originally announced November 2024.
-
Pruning the Path to Optimal Care: Identifying Systematically Suboptimal Medical Decision-Making with Inverse Reinforcement Learning
Authors:
Inko Bovenzi,
Adi Carmel,
Michael Hu,
Rebecca M. Hurwitz,
Fiona McBride,
Leo Benac,
José Roberto Tello Ayala,
Finale Doshi-Velez
Abstract:
In aims to uncover insights into medical decision-making embedded within observational data from clinical settings, we present a novel application of Inverse Reinforcement Learning (IRL) that identifies suboptimal clinician actions based on the actions of their peers. This approach centers two stages of IRL with an intermediate step to prune trajectories displaying behavior that deviates significa…
▽ More
In aims to uncover insights into medical decision-making embedded within observational data from clinical settings, we present a novel application of Inverse Reinforcement Learning (IRL) that identifies suboptimal clinician actions based on the actions of their peers. This approach centers two stages of IRL with an intermediate step to prune trajectories displaying behavior that deviates significantly from the consensus. This enables us to effectively identify clinical priorities and values from ICU data containing both optimal and suboptimal clinician decisions. We observe that the benefits of removing suboptimal actions vary by disease and differentially impact certain demographic groups.
△ Less
Submitted 7 November, 2024;
originally announced November 2024.
-
Privacy enhanced collaborative inference in the Cox proportional hazards model for distributed data
Authors:
Mengtong Hu,
Xu Shi,
Peter X. -K. Song
Abstract:
Data sharing barriers are paramount challenges arising from multicenter clinical studies where multiple data sources are stored in a distributed fashion at different local study sites. Particularly in the case of time-to-event analysis when global risk sets are needed for the Cox proportional hazards model, access to a centralized database is typically necessary. Merging such data sources into a c…
▽ More
Data sharing barriers are paramount challenges arising from multicenter clinical studies where multiple data sources are stored in a distributed fashion at different local study sites. Particularly in the case of time-to-event analysis when global risk sets are needed for the Cox proportional hazards model, access to a centralized database is typically necessary. Merging such data sources into a common data storage for a centralized statistical analysis requires a data use agreement, which is often time-consuming. Furthermore, the construction and distribution of risk sets to participating clinical centers for subsequent calculations may pose a risk of revealing individual-level information. We propose a new collaborative Cox model that eliminates the need for accessing the centralized database and constructing global risk sets but needs only the sharing of summary statistics with significantly smaller dimensions than risk sets. Thus, the proposed collaborative inference enjoys maximal protection of data privacy. We show theoretically and numerically that the new distributed proportional hazards model approach has little loss of statistical power when compared to the centralized method that requires merging the entire data. We present a renewable sieve method to establish large-sample properties for the proposed method. We illustrate its performance through simulation experiments and a real-world data example from patients with kidney transplantation in the Organ Procurement and Transplantation Network (OPTN) to understand the factors associated with the 5-year death-censored graft failure (DCGF) for patients who underwent kidney transplants in the US.
△ Less
Submitted 7 September, 2024;
originally announced September 2024.
-
Causal Inference with Latent Variables: Recent Advances and Future Prospectives
Authors:
Yaochen Zhu,
Yinhan He,
Jing Ma,
Mengxuan Hu,
Sheng Li,
Jundong Li
Abstract:
Causality lays the foundation for the trajectory of our world. Causal inference (CI), which aims to infer intrinsic causal relations among variables of interest, has emerged as a crucial research topic. Nevertheless, the lack of observation of important variables (e.g., confounders, mediators, exogenous variables, etc.) severely compromises the reliability of CI methods. The issue may arise from t…
▽ More
Causality lays the foundation for the trajectory of our world. Causal inference (CI), which aims to infer intrinsic causal relations among variables of interest, has emerged as a crucial research topic. Nevertheless, the lack of observation of important variables (e.g., confounders, mediators, exogenous variables, etc.) severely compromises the reliability of CI methods. The issue may arise from the inherent difficulty in measuring the variables. Additionally, in observational studies where variables are passively recorded, certain covariates might be inadvertently omitted by the experimenter. Depending on the type of unobserved variables and the specific CI task, various consequences can be incurred if these latent variables are carelessly handled, such as biased estimation of causal effects, incomplete understanding of causal mechanisms, lack of individual-level causal consideration, etc. In this survey, we provide a comprehensive review of recent developments in CI with latent variables. We start by discussing traditional CI techniques when variables of interest are assumed to be fully observed. Afterward, under the taxonomy of circumvention and inference-based methods, we provide an in-depth discussion of various CI strategies to handle latent variables, covering the tasks of causal effect estimation, mediation analysis, counterfactual reasoning, and causal discovery. Furthermore, we generalize the discussion to graph data where interference among units may exist. Finally, we offer fresh aspects for further advancement of CI with latent variables, especially new opportunities in the era of large language models (LLMs).
△ Less
Submitted 19 June, 2024;
originally announced June 2024.
-
Identification of Causal Relationship between Amyloid-beta Accumulation and Alzheimer's Disease Progression via Counterfactual Inference
Authors:
Haixing Dai,
Mengxuan Hu,
Qing Li,
Lu Zhang,
Lin Zhao,
Dajiang Zhu,
Ibai Diez,
Jorge Sepulcre,
Fan Zhang,
Xingyu Gao,
Manhua Liu,
Quanzheng Li,
Sheng Li,
Tianming Liu,
Xiang Li
Abstract:
Alzheimer's disease (AD) is a neurodegenerative disorder that is beginning with amyloidosis, followed by neuronal loss and deterioration in structure, function, and cognition. The accumulation of amyloid-beta in the brain, measured through 18F-florbetapir (AV45) positron emission tomography (PET) imaging, has been widely used for early diagnosis of AD. However, the relationship between amyloid-bet…
▽ More
Alzheimer's disease (AD) is a neurodegenerative disorder that is beginning with amyloidosis, followed by neuronal loss and deterioration in structure, function, and cognition. The accumulation of amyloid-beta in the brain, measured through 18F-florbetapir (AV45) positron emission tomography (PET) imaging, has been widely used for early diagnosis of AD. However, the relationship between amyloid-beta accumulation and AD pathophysiology remains unclear, and causal inference approaches are needed to uncover how amyloid-beta levels can impact AD development. In this paper, we propose a graph varying coefficient neural network (GVCNet) for estimating the individual treatment effect with continuous treatment levels using a graph convolutional neural network. We highlight the potential of causal inference approaches, including GVCNet, for measuring the regional causal connections between amyloid-beta accumulation and AD pathophysiology, which may serve as a robust tool for early diagnosis and tailored care.
△ Less
Submitted 3 July, 2023;
originally announced July 2023.
-
Futures Quantitative Investment with Heterogeneous Continual Graph Neural Network
Authors:
Min Hu,
Zhizhong Tan,
Bin Liu,
Guosheng Yin
Abstract:
This study aims to address the challenges of futures price prediction in high-frequency trading (HFT) by proposing a continuous learning factor predictor based on graph neural networks. The model integrates multi-factor pricing theories with real-time market dynamics, effectively bypassing the limitations of existing methods that lack financial theory guidance and ignore various trend signals and…
▽ More
This study aims to address the challenges of futures price prediction in high-frequency trading (HFT) by proposing a continuous learning factor predictor based on graph neural networks. The model integrates multi-factor pricing theories with real-time market dynamics, effectively bypassing the limitations of existing methods that lack financial theory guidance and ignore various trend signals and their interactions. We propose three heterogeneous tasks, including price moving average regression, price gap regression and change-point detection to trace the short-, intermediate-, and long-term trend factors present in the data. In addition, this study also considers the cross-sectional correlation characteristics of future contracts, where prices of different futures often show strong dynamic correlations. Each variable (future contract) depends not only on its historical values (temporal) but also on the observation of other variables (cross-sectional). To capture these dynamic relationships more accurately, we resort to the spatio-temporal graph neural network (STGNN) to enhance the predictive power of the model. The model employs a continuous learning strategy to simultaneously consider these tasks (factors). Additionally, due to the heterogeneity of the tasks, we propose to calculate parameter importance with mutual information between original observations and the extracted features to mitigate the catastrophic forgetting (CF) problem. Empirical tests on 49 commodity futures in China's futures market demonstrate that the proposed model outperforms other state-of-the-art models in terms of prediction accuracy. Not only does this research promote the integration of financial theory and deep learning, but it also provides a scientific basis for actual trading decisions.
△ Less
Submitted 19 December, 2023; v1 submitted 29 March, 2023;
originally announced March 2023.
-
Statistics for Spatially Stratified Heterogeneous Data
Authors:
Jinfeng Wang,
Robert Haining,
Tonglin Zhang,
Chengdong Xu,
Maogui Hu
Abstract:
Spatial statistics is dominated by spatial autocorrelation (SAC) based Kriging and BHM, and spatial local heterogeneity based hotspots and geographical regression methods, appraised as the first and second laws of Geography (Tobler 1970; Goodchild 2004), respectively. Spatial stratified heterogeneity (SSH), the phenomena of a partition that within strata is more similar than between strata, exampl…
▽ More
Spatial statistics is dominated by spatial autocorrelation (SAC) based Kriging and BHM, and spatial local heterogeneity based hotspots and geographical regression methods, appraised as the first and second laws of Geography (Tobler 1970; Goodchild 2004), respectively. Spatial stratified heterogeneity (SSH), the phenomena of a partition that within strata is more similar than between strata, examples are climate zones and landuse classes and remote sensing classification, is prevalent in geography and understood since ancient Greek, is surprisingly neglected in Spatial Statistics, probably due to the existence of hundreds of classification algorithms. In this article, we go beyond the classifications and disclose that SSH is the sources of sample bias, statistic bias, modelling confounding and misleading CI, and recommend robust solutions to overcome the negativity. In the meantime, we elaborate four benefits from SSH: creating identical PDF or equivalent to random sampling in stratum; the spatial pattern in strata, the borders between strata as a specific information for nonlinear causation; and general interaction by overlaying two spatial patterns. We developed the equation of SSH and discuss its context. The comprehensive investigation formulates the statistics for SSH, presenting a new principle and toolbox in spatial statistics.
△ Less
Submitted 30 November, 2022;
originally announced November 2022.
-
Algorithmic Decision-Making Safeguarded by Human Knowledge
Authors:
Ningyuan Chen,
Ming Hu,
Wenhao Li
Abstract:
Commercial AI solutions provide analysts and managers with data-driven business intelligence for a wide range of decisions, such as demand forecasting and pricing. However, human analysts may have their own insights and experiences about the decision-making that is at odds with the algorithmic recommendation. In view of such a conflict, we provide a general analytical framework to study the augmen…
▽ More
Commercial AI solutions provide analysts and managers with data-driven business intelligence for a wide range of decisions, such as demand forecasting and pricing. However, human analysts may have their own insights and experiences about the decision-making that is at odds with the algorithmic recommendation. In view of such a conflict, we provide a general analytical framework to study the augmentation of algorithmic decisions with human knowledge: the analyst uses the knowledge to set a guardrail by which the algorithmic decision is clipped if the algorithmic output is out of bound, and seems unreasonable. We study the conditions under which the augmentation is beneficial relative to the raw algorithmic decision. We show that when the algorithmic decision is asymptotically optimal with large data, the non-data-driven human guardrail usually provides no benefit. However, we point out three common pitfalls of the algorithmic decision: (1) lack of domain knowledge, such as the market competition, (2) model misspecification, and (3) data contamination. In these cases, even with sufficient data, the augmentation from human knowledge can still improve the performance of the algorithmic decision.
△ Less
Submitted 20 November, 2022;
originally announced November 2022.
-
Accelerated Sparse Recovery via Gradient Descent with Nonlinear Conjugate Gradient Momentum
Authors:
Mengqi Hu,
Yifei Lou,
Bao Wang,
Ming Yan,
Xiu Yang,
Qiang Ye
Abstract:
This paper applies an idea of adaptive momentum for the nonlinear conjugate gradient to accelerate optimization problems in sparse recovery. Specifically, we consider two types of minimization problems: a (single) differentiable function and the sum of a non-smooth function and a differentiable function. In the first case, we adopt a fixed step size to avoid the traditional line search and establi…
▽ More
This paper applies an idea of adaptive momentum for the nonlinear conjugate gradient to accelerate optimization problems in sparse recovery. Specifically, we consider two types of minimization problems: a (single) differentiable function and the sum of a non-smooth function and a differentiable function. In the first case, we adopt a fixed step size to avoid the traditional line search and establish the convergence analysis of the proposed algorithm for a quadratic problem. This acceleration is further incorporated with an operator splitting technique to deal with the non-smooth function in the second case. We use the convex $\ell_1$ and the nonconvex $\ell_1-\ell_2$ functionals as two case studies to demonstrate the efficiency of the proposed approaches over traditional methods.
△ Less
Submitted 5 April, 2023; v1 submitted 25 August, 2022;
originally announced August 2022.
-
Collaborative causal inference with a distributed data-sharing management
Authors:
Mengtong Hu,
Xu Shi,
Peter X. -K. Song
Abstract:
Data sharing barriers are paramount challenges arising from multicenter clinical trials where multiple data sources are stored in a distributed fashion at different local study sites. Merging such data sources into a common data storage for a centralized statistical analysis requires a data use agreement, which is often time-consuming. Data merging may become more burdensome when causal inference…
▽ More
Data sharing barriers are paramount challenges arising from multicenter clinical trials where multiple data sources are stored in a distributed fashion at different local study sites. Merging such data sources into a common data storage for a centralized statistical analysis requires a data use agreement, which is often time-consuming. Data merging may become more burdensome when causal inference is of primary interest because propensity score modeling involves combining many confounding variables, and systematic incorporation of this additional modeling in meta-analysis has not been thoroughly investigated in the literature. We propose a new causal inference framework that avoids the merging of subject-level raw data from multiple sites but needs only the sharing of summary statistics. The proposed collaborative inference enjoys maximal protection of data privacy and minimal sensitivity to unbalanced data distributions across data sources. We show theoretically and numerically that the new distributed causal inference approach has little loss of statistical power compared to the centralized method that requires merging the entire data. We present large-sample properties and algorithms for the proposed method. We illustrate its performance by simulation experiments and a real-world data example on a multicenter clinical trial of basal insulin treatment for reducing the risk of post-transplantation diabetes among kidney-transplant patients.
△ Less
Submitted 2 April, 2022;
originally announced April 2022.
-
Theory for identification and Inference with Synthetic Controls: A Proximal Causal Inference Framework
Authors:
Xu Shi,
Kendrick Li,
Wang Miao,
Mengtong Hu,
Eric Tchetgen Tchetgen
Abstract:
Synthetic control (SC) methods are commonly used to estimate the treatment effect on a single treated unit in panel data settings. An SC is a weighted average of control units built to match the treated unit, with weights typically estimated by regressing (summaries of) pre-treatment outcomes and measured covariates of the treated unit to those of the control units. However, it has been establishe…
▽ More
Synthetic control (SC) methods are commonly used to estimate the treatment effect on a single treated unit in panel data settings. An SC is a weighted average of control units built to match the treated unit, with weights typically estimated by regressing (summaries of) pre-treatment outcomes and measured covariates of the treated unit to those of the control units. However, it has been established that in the absence of a good fit, such regression estimator will generally perform poorly. In this paper, we introduce a proximal causal inference framework to formalize identification and inference for both the SC and ultimately the treatment effect on the treated, based on the observation that control units not contributing to the construction of an SC can be repurposed as proxies of latent confounders. We view the difference in the post-treatment outcomes between the treated unit and the SC as a time series, which opens the door to various time series methods for treatment effect estimation. The proposed framework can accommodate nonlinear models, which allows for binary and count outcomes that are understudied in the SC literature. We illustrate with simulation studies and an application to evaluation of the 1990 German Reunification.
△ Less
Submitted 18 February, 2023; v1 submitted 31 August, 2021;
originally announced August 2021.
-
Knowledge distillation from multi-modal to mono-modal segmentation networks
Authors:
Minhao Hu,
Matthis Maillard,
Ya Zhang,
Tommaso Ciceri,
Giammarco La Barbera,
Isabelle Bloch,
Pietro Gori
Abstract:
The joint use of multiple imaging modalities for medical image segmentation has been widely studied in recent years. The fusion of information from different modalities has demonstrated to improve the segmentation accuracy, with respect to mono-modal segmentations, in several applications. However, acquiring multiple modalities is usually not possible in a clinical setting due to a limited number…
▽ More
The joint use of multiple imaging modalities for medical image segmentation has been widely studied in recent years. The fusion of information from different modalities has demonstrated to improve the segmentation accuracy, with respect to mono-modal segmentations, in several applications. However, acquiring multiple modalities is usually not possible in a clinical setting due to a limited number of physicians and scanners, and to limit costs and scan time. Most of the time, only one modality is acquired. In this paper, we propose KD-Net, a framework to transfer knowledge from a trained multi-modal network (teacher) to a mono-modal one (student). The proposed method is an adaptation of the generalized distillation framework where the student network is trained on a subset (1 modality) of the teacher's inputs (n modalities). We illustrate the effectiveness of the proposed framework in brain tumor segmentation with the BraTS 2018 dataset. Using different architectures, we show that the student network effectively learns from the teacher and always outperforms the baseline mono-modal network in terms of segmentation accuracy.
△ Less
Submitted 17 June, 2021;
originally announced June 2021.
-
A general framework of rotational sparse approximation in uncertainty quantification
Authors:
Mengqi Hu,
Yifei Lou,
Xiu Yang
Abstract:
This paper proposes a general framework to estimate coefficients of generalized polynomial chaos (gPC) used in uncertainty quantification via rotational sparse approximation. In particular, we aim to identify a rotation matrix such that the gPC expansion of a set of random variables after the rotation has a sparser representation. However, this rotational approach alters the underlying linear syst…
▽ More
This paper proposes a general framework to estimate coefficients of generalized polynomial chaos (gPC) used in uncertainty quantification via rotational sparse approximation. In particular, we aim to identify a rotation matrix such that the gPC expansion of a set of random variables after the rotation has a sparser representation. However, this rotational approach alters the underlying linear system to be solved, which makes finding the sparse coefficients more difficult than the case without rotation. To solve this problem, we examine several popular nonconvex regularizations in compressive sensing (CS) that perform better than the classic l1 approach empirically. All these regularizations can be minimized by the alternating direction method of multipliers (ADMM). Numerical examples show superior performance of the proposed combination of rotation and nonconvex sparse promoting regularizations over the ones without rotation and with rotation but using the convex l1 approach.
△ Less
Submitted 17 September, 2021; v1 submitted 13 January, 2021;
originally announced January 2021.
-
Seasonal association between viral causes of hospitalised acute lower respiratory infections and meteorological factors in China: a retrospective study
Authors:
Bing Xu,
Jinfeng Wang,
Zhongjie Li,
Chengdong Xu,
Yilan Liao,
Maogui Hu,
Jing Yang,
Shengjie Lai,
Liping Wang,
Weizhong Yang
Abstract:
Acute lower respiratory infections caused by respiratory viruses are common and persistent infectious diseases worldwide and in China, which have pronounced seasonal patterns. Meteorological factors have important roles in the seasonality of some major viruses. Our aim was to identify the dominant meteorological factors and to model their effects on common respiratory viruses in different regions…
▽ More
Acute lower respiratory infections caused by respiratory viruses are common and persistent infectious diseases worldwide and in China, which have pronounced seasonal patterns. Meteorological factors have important roles in the seasonality of some major viruses. Our aim was to identify the dominant meteorological factors and to model their effects on common respiratory viruses in different regions of China. We analysed monthly virus data on patients from 81 sentinel hospitals in 22 provinces in mainland China from 2009 to 2013. The geographical detector method was used to quantify the explanatory power of each meteorological factor, individually and interacting in pairs. 28369 hospitalised patients with ALRI were tested, 10387 were positive for at least one virus, including RSV, influenza virus, PIV, ADV, hBoV, hCoV and hMPV. RSV and influenza virus had annual peaks in the north and biannual peaks in the south. PIV and hBoV had higher positive rates in the spring summer months. hMPV had an annual peak in winter spring, especially in the north. ADV and hCoV exhibited no clear annual seasonality. Temperature, atmospheric pressure, vapour pressure, and rainfall had most explanatory power on most respiratory viruses in each region. Relative humidity was only dominant in the north, but had no significant explanatory power for most viruses in the south. Hours of sunlight had significant explanatory power for RSV and influenza virus in the north, and for most viruses in the south. Wind speed was the only factor with significant explanatory power for human coronavirus in the south. For all viruses, interactions between any two of the paired factors resulted in enhanced explanatory power, either bivariately or non-linearly.
△ Less
Submitted 15 April, 2021; v1 submitted 30 November, 2020;
originally announced December 2020.
-
Predicting conversions in display advertising based on URL embeddings
Authors:
Yang Qiu,
Nikolaos Tziortziotis,
Martial Hue,
Michalis Vazirgiannis
Abstract:
Online display advertising is growing rapidly in recent years thanks to the automation of the ad buying process. Real-time bidding (RTB) allows the automated trading of ad impressions between advertisers and publishers through real-time auctions. In order to increase the effectiveness of their campaigns, advertisers should deliver ads to the users who are highly likely to be converted (i.e., purch…
▽ More
Online display advertising is growing rapidly in recent years thanks to the automation of the ad buying process. Real-time bidding (RTB) allows the automated trading of ad impressions between advertisers and publishers through real-time auctions. In order to increase the effectiveness of their campaigns, advertisers should deliver ads to the users who are highly likely to be converted (i.e., purchase, registration, website visit, etc.) in the near future. In this study, we introduce and examine different models for estimating the probability of a user converting, given their history of visited URLs. Inspired by natural language processing, we introduce three URL embedding models to compute semantically meaningful URL representations. To demonstrate the effectiveness of the different proposed representation and conversion prediction models, we have conducted experiments on real logged events collected from an advertising platform.
△ Less
Submitted 28 August, 2020; v1 submitted 27 August, 2020;
originally announced August 2020.
-
COVID-19 in a social reinsurance framework: Forewarned is forearmed
Authors:
S. Sahin,
M. C. Boado-Penas,
C. Constantinescu,
J. Eisenberg,
K. Henshaw,
M. Hu,
J. Wang,
W. Zhu
Abstract:
The crisis caused by COVID-19 revealed the global unpreparedness to handle the impact of a pandemic. In this paper, we present a statistical analysis of the data related to the COVID-19 outbreak in China, specifically the infection speed, death and fatality rates in Hubei province. By fitting distributions of these quantities we design a parametric reinsurance contract whose trigger and cap are ba…
▽ More
The crisis caused by COVID-19 revealed the global unpreparedness to handle the impact of a pandemic. In this paper, we present a statistical analysis of the data related to the COVID-19 outbreak in China, specifically the infection speed, death and fatality rates in Hubei province. By fitting distributions of these quantities we design a parametric reinsurance contract whose trigger and cap are based on the probability distributions of the infection speed, death and fatality rates. In particular, fitting the distribution for the infection speed and death rates we provide a measure of the effectiveness of a state's action during an epidemic, and propose a reinsurance contract as a supplement to a state's social insurance to alleviate financial costs.
△ Less
Submitted 9 April, 2020;
originally announced April 2020.
-
Generative adversarial networks (GAN) based efficient sampling of chemical space for inverse design of inorganic materials
Authors:
Yabo Dan,
Yong Zhao,
Xiang Li,
Shaobo Li,
Ming Hu,
Jianjun Hu
Abstract:
A major challenge in materials design is how to efficiently search the vast chemical design space to find the materials with desired properties. One effective strategy is to develop sampling algorithms that can exploit both explicit chemical knowledge and implicit composition rules embodied in the large materials database. Here, we propose a generative machine learning model (MatGAN) based on a ge…
▽ More
A major challenge in materials design is how to efficiently search the vast chemical design space to find the materials with desired properties. One effective strategy is to develop sampling algorithms that can exploit both explicit chemical knowledge and implicit composition rules embodied in the large materials database. Here, we propose a generative machine learning model (MatGAN) based on a generative adversarial network (GAN) for efficient generation of new hypothetical inorganic materials. Trained with materials from the ICSD database, our GAN model can generate hypothetical materials not existing in the training dataset, reaching a novelty of 92.53% when generating 2 million samples. The percentage of chemically valid (charge neutral and electronegativity balanced) samples out of all generated ones reaches 84.5% by our GAN when trained with materials from ICSD even though no such chemical rules are explicitly enforced in our GAN model, indicating its capability to learn implicit chemical composition rules. Our algorithm could be used to speed up inverse design or computational screening of inorganic materials.
△ Less
Submitted 12 November, 2019;
originally announced November 2019.
-
Exploring Bias in GAN-based Data Augmentation for Small Samples
Authors:
Mengxiao Hu,
Jinlong Li
Abstract:
For machine learning task, lacking sufficient samples mean the trained model has low confidence to approach the ground truth function. Until recently, after the generative adversarial networks (GAN) had been proposed, we see the hope of small samples data augmentation (DA) with realistic fake data, and many works validated the viability of GAN-based DA. Although most of the works pointed out highe…
▽ More
For machine learning task, lacking sufficient samples mean the trained model has low confidence to approach the ground truth function. Until recently, after the generative adversarial networks (GAN) had been proposed, we see the hope of small samples data augmentation (DA) with realistic fake data, and many works validated the viability of GAN-based DA. Although most of the works pointed out higher accuracy can be achieved using GAN-based DA, some researchers stressed that the fake data generated from GAN has inherent bias, and in this paper, we explored when the bias is so low that it cannot hurt the performance, we set experiments to depict the bias in different GAN-based DA setting, and from the results, we design a pipeline to inspect specific dataset is efficiently-augmentable with GAN-based DA or not. And finally, depending on our trial to reduce the bias, we proposed some advice to mitigate bias in GAN-based DA application.
△ Less
Submitted 21 May, 2019;
originally announced May 2019.
-
Doubly Aligned Incomplete Multi-view Clustering
Authors:
Menglei Hu,
Songcan Chen
Abstract:
Nowadays, multi-view clustering has attracted more and more attention. To date, almost all the previous studies assume that views are complete. However, in reality, it is often the case that each view may contain some missing instances. Such incompleteness makes it impossible to directly use traditional multi-view clustering methods. In this paper, we propose a Doubly Aligned Incomplete Multi-view…
▽ More
Nowadays, multi-view clustering has attracted more and more attention. To date, almost all the previous studies assume that views are complete. However, in reality, it is often the case that each view may contain some missing instances. Such incompleteness makes it impossible to directly use traditional multi-view clustering methods. In this paper, we propose a Doubly Aligned Incomplete Multi-view Clustering algorithm (DAIMC) based on weighted semi-nonnegative matrix factorization (semi-NMF). Specifically, on the one hand, DAIMC utilizes the given instance alignment information to learn a common latent feature matrix for all the views. On the other hand, DAIMC establishes a consensus basis matrix with the help of $L_{2,1}$-Norm regularized regression for reducing the influence of missing instances. Consequently, compared with existing methods, besides inheriting the strength of semi-NMF with ability to handle negative entries, DAIMC has two unique advantages: 1) solving the incomplete view problem by introducing a respective weight matrix for each view, making it able to easily adapt to the case with more than two views; 2) reducing the influence of view incompleteness on clustering by enforcing the basis matrices of individual views being aligned with the help of regression. Experiments on four real-world datasets demonstrate its advantages.
△ Less
Submitted 7 March, 2019;
originally announced March 2019.
-
One-Pass Incomplete Multi-view Clustering
Authors:
Menglei Hu,
Songcan Chen
Abstract:
Real data are often with multiple modalities or from multiple heterogeneous sources, thus forming so-called multi-view data, which receives more and more attentions in machine learning. Multi-view clustering (MVC) becomes its important paradigm. In real-world applications, some views often suffer from instances missing. Clustering on such multi-view datasets is called incomplete multi-view cluster…
▽ More
Real data are often with multiple modalities or from multiple heterogeneous sources, thus forming so-called multi-view data, which receives more and more attentions in machine learning. Multi-view clustering (MVC) becomes its important paradigm. In real-world applications, some views often suffer from instances missing. Clustering on such multi-view datasets is called incomplete multi-view clustering (IMC) and quite challenging. To date, though many approaches have been developed, most of them are offline and have high computational and memory costs especially for large scale datasets. To address this problem, in this paper, we propose an One-Pass Incomplete Multi-view Clustering framework (OPIMC). With the help of regularized matrix factorization and weighted matrix factorization, OPIMC can relatively easily deal with such problem. Different from the existing and sole online IMC method, OPIMC can directly get clustering results and effectively determine the termination of iteration process by introducing two global statistics. Finally, extensive experiments conducted on four real datasets demonstrate the efficiency and effectiveness of the proposed OPIMC method.
△ Less
Submitted 2 March, 2019;
originally announced March 2019.
-
Off-Policy Evaluation of Probabilistic Identity Data in Lookalike Modeling
Authors:
Randell Cotta,
Mingyang Hu,
Dan Jiang,
Peizhou Liao
Abstract:
We evaluate the impact of probabilistically-constructed digital identity data collected from Sep. to Dec. 2017 (approx.), in the context of Lookalike-targeted campaigns. The backbone of this study is a large set of probabilistically-constructed "identities", represented as small bags of cookies and mobile ad identifiers with associated metadata, that are likely all owned by the same underlying use…
▽ More
We evaluate the impact of probabilistically-constructed digital identity data collected from Sep. to Dec. 2017 (approx.), in the context of Lookalike-targeted campaigns. The backbone of this study is a large set of probabilistically-constructed "identities", represented as small bags of cookies and mobile ad identifiers with associated metadata, that are likely all owned by the same underlying user. The identity data allows to generate "identity-based", rather than "identifier-based", user models, giving a fuller picture of the interests of the users underlying the identifiers. We employ off-policy techniques to evaluate the potential of identity-powered lookalike models without incurring the risk of allowing untested models to direct large amounts of ad spend or the large cost of performing A/B tests. We add to historical work on off-policy evaluation by noting a significant type of "finite-sample bias" that occurs for studies combining modestly-sized datasets and evaluation metrics involving rare events (e.g., conversions). We illustrate this bias using a simulation study that later informs the handling of inverse propensity weights in our analyses on real data. We demonstrate significant lift in identity-powered lookalikes versus an identity-ignorant baseline: on average ~70% lift in conversion rate. This rises to factors of ~(4-32)x for identifiers having little data themselves, but that can be inferred to belong to users with substantial data to aggregate across identifiers. This implies that identity-powered user modeling is especially important in the context of identifiers having very short lifespans (i.e., frequently churned cookies). Our work motivates and informs the use of probabilistically-constructed identities in marketing. It also deepens the canon of examples in which off-policy learning has been employed to evaluate the complex systems of the internet economy.
△ Less
Submitted 3 January, 2019;
originally announced January 2019.
-
Detection of REM Sleep Behaviour Disorder by Automated Polysomnography Analysis
Authors:
Navin Cooray,
Fernando Andreotti,
Christine Lo,
Mkael Symmonds,
Michele T. M. Hu,
Maarten De Vos
Abstract:
Evidence suggests Rapid-Eye-Movement (REM) Sleep Behaviour Disorder (RBD) is an early predictor of Parkinson's disease. This study proposes a fully-automated framework for RBD detection consisting of automated sleep staging followed by RBD identification. Analysis was assessed using a limited polysomnography montage from 53 participants with RBD and 53 age-matched healthy controls. Sleep stage cla…
▽ More
Evidence suggests Rapid-Eye-Movement (REM) Sleep Behaviour Disorder (RBD) is an early predictor of Parkinson's disease. This study proposes a fully-automated framework for RBD detection consisting of automated sleep staging followed by RBD identification. Analysis was assessed using a limited polysomnography montage from 53 participants with RBD and 53 age-matched healthy controls. Sleep stage classification was achieved using a Random Forest (RF) classifier and 156 features extracted from electroencephalogram (EEG), electrooculogram (EOG) and electromyogram (EMG) channels. For RBD detection, a RF classifier was trained combining established techniques to quantify muscle atonia with additional features that incorporate sleep architecture and the EMG fractal exponent. Automated multi-state sleep staging achieved a 0.62 Cohen's Kappa score. RBD detection accuracy improved by 10% to 96% (compared to individual established metrics) when using manually annotated sleep staging. Accuracy remained high (92%) when using automated sleep staging. This study outperforms established metrics and demonstrates that incorporating sleep architecture and sleep stage transitions can benefit RBD detection. This study also achieved automated sleep staging with a level of accuracy comparable to manual annotation. This study validates a tractable, fully-automated, and sensitive pipeline for RBD identification that could be translated to wearable take-home technology.
△ Less
Submitted 12 November, 2018;
originally announced November 2018.
-
Neural CRF transducers for sequence labeling
Authors:
Kai Hu,
Zhijian Ou,
Min Hu,
Junlan Feng
Abstract:
Conditional random fields (CRFs) have been shown to be one of the most successful approaches to sequence labeling. Various linear-chain neural CRFs (NCRFs) are developed to implement the non-linear node potentials in CRFs, but still keeping the linear-chain hidden structure. In this paper, we propose NCRF transducers, which consists of two RNNs, one extracting features from observations and the ot…
▽ More
Conditional random fields (CRFs) have been shown to be one of the most successful approaches to sequence labeling. Various linear-chain neural CRFs (NCRFs) are developed to implement the non-linear node potentials in CRFs, but still keeping the linear-chain hidden structure. In this paper, we propose NCRF transducers, which consists of two RNNs, one extracting features from observations and the other capturing (theoretically infinite) long-range dependencies between labels. Different sequence labeling methods are evaluated over POS tagging, chunking and NER (English, Dutch). Experiment results show that NCRF transducers achieve consistent improvements over linear-chain NCRFs and RNN transducers across all the four tasks, and can improve state-of-the-art results.
△ Less
Submitted 4 November, 2018;
originally announced November 2018.
-
Memristor-based Deep Convolution Neural Network: A Case Study
Authors:
Fan Zhang,
Miao Hu
Abstract:
In this paper, we firstly introduce a method to efficiently implement large-scale high-dimensional convolution with realistic memristor-based circuit components. An experiment verified simulator is adapted for accurate prediction of analog crossbar behavior. An improved conversion algorithm is developed to convert convolution kernels to memristor-based circuits, which minimizes the error with cons…
▽ More
In this paper, we firstly introduce a method to efficiently implement large-scale high-dimensional convolution with realistic memristor-based circuit components. An experiment verified simulator is adapted for accurate prediction of analog crossbar behavior. An improved conversion algorithm is developed to convert convolution kernels to memristor-based circuits, which minimizes the error with consideration of the data and kernel patterns in CNNs. With circuit simulation for all convolution layers in ResNet-20, we found that 8-bit ADC/DAC is necessary to preserve software level classification accuracy.
△ Less
Submitted 14 September, 2018;
originally announced October 2018.
-
Multi-target Unsupervised Domain Adaptation without Exactly Shared Categories
Authors:
Huanhuan Yu,
Menglei Hu,
Songcan Chen
Abstract:
Unsupervised domain adaptation (UDA) aims to learn the unlabeled target domain by transferring the knowledge of the labeled source domain. To date, most of the existing works focus on the scenario of one source domain and one target domain (1S1T), and just a few works concern the scenario of multiple source domains and one target domain (mS1T). While, to the best of our knowledge, almost no work c…
▽ More
Unsupervised domain adaptation (UDA) aims to learn the unlabeled target domain by transferring the knowledge of the labeled source domain. To date, most of the existing works focus on the scenario of one source domain and one target domain (1S1T), and just a few works concern the scenario of multiple source domains and one target domain (mS1T). While, to the best of our knowledge, almost no work concerns the scenario of one source domain and multiple target domains (1SmT), in which these unlabeled target domains may not necessarily share the same categories, therefore, contrasting to mS1T, 1SmT is more challenging. Accordingly, for such a new UDA scenario, we propose a UDA framework through the model parameter adaptation (PA-1SmT). A key ingredient of PA-1SmT is to transfer knowledge through adaptive learning of a common model parameter dictionary, which is completely different from existing popular methods for UDA, such as subspace alignment, distribution matching etc., and can also be directly used for DA of privacy protection due to the fact that the knowledge is transferred just via the model parameters rather than data itself. Finally, our experimental results on three domain adaptation benchmark datasets demonstrate the superiority of our framework.
△ Less
Submitted 17 September, 2018; v1 submitted 4 September, 2018;
originally announced September 2018.
-
Modeling Coefficient Alpha for Measurement of Individualized Test Score Internal Consistency
Authors:
Molei Liu,
Ming Hu,
Xiaohua Zhou
Abstract:
A method for measuring individualized reliability of several tests on subjects with heterogenecity is proposed. A regression model is developed based on three sets of generalized estimating equations (GEE). The first set of GEE models the expectation of the responses, the second set of GEE models the response's variance, and the third set is proposed to estimate the individualized coefficient alph…
▽ More
A method for measuring individualized reliability of several tests on subjects with heterogenecity is proposed. A regression model is developed based on three sets of generalized estimating equations (GEE). The first set of GEE models the expectation of the responses, the second set of GEE models the response's variance, and the third set is proposed to estimate the individualized coefficient alpha, defined and used to measure individualized internal consistency of the responses. We also extend our method to handle missing data in the covariates. Asymptotic property of the estimators is discussed, based on which interval estimation of the coefficient alpha and significance detection are derived. Performance of our method is evaluated through simulation study and real data analysis. The real data application is from a health literacy study in Hunan province of China.
△ Less
Submitted 8 September, 2017;
originally announced September 2017.
-
DIMM-SC: A Dirichlet mixture model for clustering droplet-based single cell transcriptomic data
Authors:
Zhe Sun,
Ting Wang,
Ke Deng,
Xiao-Feng Wang,
Robert Lafyatis,
Ying Ding,
Ming Hu,
Wei Chen
Abstract:
Motivation: Single cell transcriptome sequencing (scRNA-Seq) has become a revolutionary tool to study cellular and molecular processes at single cell resolution. Among existing technologies, the recently developed droplet-based platform enables efficient parallel processing of thousands of single cells with direct counting of transcript copies using Unique Molecular Identifier (UMI). Despite the t…
▽ More
Motivation: Single cell transcriptome sequencing (scRNA-Seq) has become a revolutionary tool to study cellular and molecular processes at single cell resolution. Among existing technologies, the recently developed droplet-based platform enables efficient parallel processing of thousands of single cells with direct counting of transcript copies using Unique Molecular Identifier (UMI). Despite the technology advances, statistical methods and computational tools are still lacking for analyzing droplet-based scRNA-Seq data. Particularly, model-based approaches for clustering large-scale single cell transcriptomic data are still under-explored. Methods: We developed DIMM-SC, a Dirichlet Mixture Model for clustering droplet-based Single Cell transcriptomic data. This approach explicitly models UMI count data from scRNA-Seq experiments and characterizes variations across different cell clusters via a Dirichlet mixture prior. An expectation-maximization algorithm is used for parameter inference. Results: We performed comprehensive simulations to evaluate DIMM-SC and compared it with existing clustering methods such as K-means, CellTree and Seurat. In addition, we analyzed public scRNA-Seq datasets with known cluster labels and in-house scRNA-Seq datasets from a study of systemic sclerosis with prior biological knowledge to benchmark and validate DIMM-SC. Both simulation studies and real data applications demonstrated that overall, DIMM-SC achieves substantially improved clustering accuracy and much lower clustering variability compared to other existing clustering methods. More importantly, as a model-based approach, DIMM-SC is able to quantify the clustering uncertainty for each single cell, facilitating rigorous statistical inference and biological interpretations, which are typically unavailable from existing clustering methods.
△ Less
Submitted 6 April, 2017;
originally announced April 2017.
-
Detection of treatment effects by covariate-adjusted expected shortfall
Authors:
Xuming He,
Ya-Hui Hsu,
Mingxiu Hu
Abstract:
The statistical tests that are commonly used for detecting mean or median treatment effects suffer from low power when the two distribution functions differ only in the upper (or lower) tail, as in the assessment of the Total Sharp Score (TSS) under different treatments for rheumatoid arthritis. In this article, we propose a more powerful test that detects treatment effects through the expected sh…
▽ More
The statistical tests that are commonly used for detecting mean or median treatment effects suffer from low power when the two distribution functions differ only in the upper (or lower) tail, as in the assessment of the Total Sharp Score (TSS) under different treatments for rheumatoid arthritis. In this article, we propose a more powerful test that detects treatment effects through the expected shortfalls. We show how the expected shortfall can be adjusted for covariates, and demonstrate that the proposed test can achieve a substantial sample size reduction over the conventional tests on the mean effects.
△ Less
Submitted 7 January, 2011;
originally announced January 2011.