-
Spectral clustering for dependent community Hawkes process models of temporal networks
Authors:
Lingfei Zhao,
Hadeel Soliman,
Kevin S. Xu,
Subhadeep Paul
Abstract:
Temporal networks observed continuously over time through timestamped relational events data are commonly encountered in application settings including online social media communications, financial transactions, and international relations. Temporal networks often exhibit community structure and strong dependence patterns among node pairs. This dependence can be modeled through mutual excitations,…
▽ More
Temporal networks observed continuously over time through timestamped relational events data are commonly encountered in application settings including online social media communications, financial transactions, and international relations. Temporal networks often exhibit community structure and strong dependence patterns among node pairs. This dependence can be modeled through mutual excitations, where an interaction event from a sender to a receiver node increases the possibility of future events among other node pairs.
We provide statistical results for a class of models that we call dependent community Hawkes (DCH) models, which combine the stochastic block model with mutually exciting Hawkes processes for modeling both community structure and dependence among node pairs, respectively. We derive a non-asymptotic upper bound on the misclustering error of spectral clustering on the event count matrix as a function of the number of nodes and communities, time duration, and the amount of dependence in the model. Our result leverages recent results on bounding an appropriate distance between a multivariate Hawkes process count vector and a Gaussian vector, along with results from random matrix theory. We also propose a DCH model that incorporates only self and reciprocal excitation along with highly scalable parameter estimation using a Generalized Method of Moments (GMM) estimator that we demonstrate to be consistent for growing network size and time duration.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
Heterogeneous transfer learning for high dimensional regression with feature mismatch
Authors:
Jae Ho Chang,
Massimiliano Russo,
Subhadeep Paul
Abstract:
We consider the problem of transferring knowledge from a source, or proxy, domain to a new target domain for learning a high-dimensional regression model with possibly different features. Recently, the statistical properties of homogeneous transfer learning have been investigated. However, most homogeneous transfer and multi-task learning methods assume that the target and proxy domains have the s…
▽ More
We consider the problem of transferring knowledge from a source, or proxy, domain to a new target domain for learning a high-dimensional regression model with possibly different features. Recently, the statistical properties of homogeneous transfer learning have been investigated. However, most homogeneous transfer and multi-task learning methods assume that the target and proxy domains have the same feature space, limiting their practical applicability. In applications, target and proxy feature spaces are frequently inherently different, for example, due to the inability to measure some variables in the target data-poor environments. Conversely, existing heterogeneous transfer learning methods do not provide statistical error guarantees, limiting their utility for scientific discovery. We propose a two-stage method that involves learning the relationship between the missing and observed features through a projection step in the proxy data and then solving a joint penalized regression optimization problem in the target data. We develop an upper bound on the method's parameter estimation risk and prediction risk, assuming that the proxy and the target domain parameters are sparsely different. Our results elucidate how estimation and prediction error depend on the complexity of the model, sample size, the extent of overlap, and correlation between matched and mismatched features.
△ Less
Submitted 23 December, 2024;
originally announced December 2024.
-
The co-varying ties between networks and item responses via latent variables
Authors:
Selena Wang,
Plamena Powla,
Tracy Sweet,
Subhadeep Paul
Abstract:
Relationships among teachers are known to influence their teaching-related perceptions. We study whether and how teachers' advising relationships (networks) are related to their perceptions of satisfaction, students, and influence over educational policies, recorded as their responses to a questionnaire (item responses). We propose a novel joint model of network and item responses (JNIRM) with cor…
▽ More
Relationships among teachers are known to influence their teaching-related perceptions. We study whether and how teachers' advising relationships (networks) are related to their perceptions of satisfaction, students, and influence over educational policies, recorded as their responses to a questionnaire (item responses). We propose a novel joint model of network and item responses (JNIRM) with correlated latent variables to understand these co-varying ties. This methodology allows the analyst to test and interpret the dependence between a network and item responses. Using JNIRM, we discover that teachers' advising relationships contribute to their perceptions of satisfaction and students more often than their perceptions of influence over educational policies. In addition, we observe that the complementarity principle applies in certain schools, where teachers tend to seek advice from those who are different from them. JNIRM shows superior parameter estimation and model fit over separately modeling the network and item responses with latent variable models.
△ Less
Submitted 28 September, 2024;
originally announced September 2024.
-
Embedding Network Autoregression for time series analysis and causal peer effect inference
Authors:
Jae Ho Chang,
Subhadeep Paul
Abstract:
We propose an Embedding Network Autoregressive Model for multivariate networked longitudinal data. We assume the network is generated from a latent variable model, and these unobserved variables are included in a structural peer effect model or a time series network autoregressive model as additive effects. This approach takes a unified view of two related yet fundamentally different problems: (1)…
▽ More
We propose an Embedding Network Autoregressive Model for multivariate networked longitudinal data. We assume the network is generated from a latent variable model, and these unobserved variables are included in a structural peer effect model or a time series network autoregressive model as additive effects. This approach takes a unified view of two related yet fundamentally different problems: (1) modeling and predicting multivariate networked time series data and (2) causal peer influence estimation in the presence of homophily from finite time longitudinal data. Our estimation strategy comprises estimating latent variables from the observed network followed by least squares estimation of the network autoregressive model. We show that the estimated momentum and peer effect parameters are consistent and asymptotically normally distributed in setups with a growing number of network vertices (N) while considering both a growing number of time points T (for the time series problem) and finite T cases (for the peer effect problem). We allow the number of latent vectors K to grow at appropriate rates, which improves upon existing rates when such results are available for related models. Our theoretical results encompass cases both when the network is modeled with the random dot product graph model (ENAR) and a more general latent space model with both additive and multiplicative effects (AMNAR). We also develop a selection criterion when K is unknown that provably does not under-select and show that the theoretical guarantees hold with the selected number for K as well. Interestingly, even though we propose a unified model, our theoretical results find that different growth rates and restrictions on the latent vectors are needed to induce omitted variable bias in the peer effect problem and to ensure consistent estimation in the time series problem.
△ Less
Submitted 23 March, 2025; v1 submitted 9 June, 2024;
originally announced June 2024.
-
Spatial autoregressive model with measurement error in covariates
Authors:
Subhadeep Paul,
Shanjukta Nath
Abstract:
The Spatial AutoRegressive model (SAR) is commonly used in studies involving spatial and network data to estimate the spatial or network peer influence and the effects of covariates on the response, taking into account the dependence among units. While the model can be efficiently estimated with a Quasi maximum likelihood approach (QMLE), the detrimental effect of covariate measurement error on th…
▽ More
The Spatial AutoRegressive model (SAR) is commonly used in studies involving spatial and network data to estimate the spatial or network peer influence and the effects of covariates on the response, taking into account the dependence among units. While the model can be efficiently estimated with a Quasi maximum likelihood approach (QMLE), the detrimental effect of covariate measurement error on the QMLE and how to remedy it is currently unknown. If covariates are measured with error, then the QMLE may not have the $\sqrt{n}$ convergence and may even be inconsistent even when a node is influenced by only a limited number of other nodes or spatial units. We develop a measurement error-corrected ML estimator (ME-QMLE) for the parameters of the SAR model when covariates are measured with error. The ME-QMLE possesses statistical consistency and asymptotic normality properties and we derive its limiting covariance. We consider two types of applications. The first is when the true covariate is imprecisely measured with replicated measurements or cannot be measured directly, and a proxy is observed instead. The second one involves including latent homophily factors estimated with error from the network for estimating peer influence. Our numerical results verify the bias correction property of the estimator and the accuracy of the standard error estimates in finite samples. We illustrate the method on two real datasets; i) peer influence in GPA for middle school students in New Jersey and ii) county-level death rates from the COVID-19 pandemic.
△ Less
Submitted 6 August, 2024; v1 submitted 7 February, 2024;
originally announced February 2024.
-
A Mutually Exciting Latent Space Hawkes Process Model for Continuous-time Networks
Authors:
Zhipeng Huang,
Hadeel Soliman,
Subhadeep Paul,
Kevin S. Xu
Abstract:
Networks and temporal point processes serve as fundamental building blocks for modeling complex dynamic relational data in various domains. We propose the latent space Hawkes (LSH) model, a novel generative model for continuous-time networks of relational events, using a latent space representation for nodes. We model relational events between nodes using mutually exciting Hawkes processes with ba…
▽ More
Networks and temporal point processes serve as fundamental building blocks for modeling complex dynamic relational data in various domains. We propose the latent space Hawkes (LSH) model, a novel generative model for continuous-time networks of relational events, using a latent space representation for nodes. We model relational events between nodes using mutually exciting Hawkes processes with baseline intensities dependent upon the distances between the nodes in the latent space and sender and receiver specific effects. We demonstrate that our proposed LSH model can replicate many features observed in real temporal networks including reciprocity and transitivity, while also achieving superior prediction accuracy and providing more interpretable fits than existing models.
△ Less
Submitted 6 July, 2022; v1 submitted 18 May, 2022;
originally announced May 2022.
-
The Multivariate Community Hawkes Model for Dependent Relational Events in Continuous-time Networks
Authors:
Hadeel Soliman,
Lingfei Zhao,
Zhipeng Huang,
Subhadeep Paul,
Kevin S. Xu
Abstract:
The stochastic block model (SBM) is one of the most widely used generative models for network data. Many continuous-time dynamic network models are built upon the same assumption as the SBM: edges or events between all pairs of nodes are conditionally independent given the block or community memberships, which prevents them from reproducing higher-order motifs such as triangles that are commonly o…
▽ More
The stochastic block model (SBM) is one of the most widely used generative models for network data. Many continuous-time dynamic network models are built upon the same assumption as the SBM: edges or events between all pairs of nodes are conditionally independent given the block or community memberships, which prevents them from reproducing higher-order motifs such as triangles that are commonly observed in real networks. We propose the multivariate community Hawkes (MULCH) model, an extremely flexible community-based model for continuous-time networks that introduces dependence between node pairs using structured multivariate Hawkes processes. We fit the model using a spectral clustering and likelihood-based local refinement procedure. We find that our proposed MULCH model is far more accurate than existing models both for predictive and generative tasks.
△ Less
Submitted 6 July, 2022; v1 submitted 2 May, 2022;
originally announced May 2022.
-
Identifying Peer Influence in Therapeutic Communities Adjusting for Latent Homophily
Authors:
Shanjukta Nath,
Keith Warren,
Subhadeep Paul
Abstract:
We investigate peer role model influence on successful graduation from Therapeutic Communities (TCs) for substance abuse and criminal behavior. We use data from 3 TCs that kept records of exchanges of affirmations among residents and their precise entry and exit dates, allowing us to form peer networks and define a causal effect of interest. The role model effect measures the difference in the exp…
▽ More
We investigate peer role model influence on successful graduation from Therapeutic Communities (TCs) for substance abuse and criminal behavior. We use data from 3 TCs that kept records of exchanges of affirmations among residents and their precise entry and exit dates, allowing us to form peer networks and define a causal effect of interest. The role model effect measures the difference in the expected outcome of a resident (ego) who can observe one of their peers graduate before the ego's exit vs not graduating. To identify peer influence in the presence of unobserved homophily in observational data, we model the network with a latent variable model. We show that our peer influence estimator is asymptotically unbiased when the unobserved latent positions are estimated from the observed network. We additionally propose a measurement error bias correction method to further reduce bias due to estimating latent positions. Our simulations show the proposed latent homophily adjustment and bias correction perform well in finite samples. We also extend the methodology to the case of binary response with a probit model. Our results indicate a positive effect of peers' graduation on residents' graduation and that it differs based on gender, race, and the definition of the role model effect. A counterfactual exercise quantifies the potential benefits of an intervention directly on the treated resident and indirectly on their peers through network propagation.
△ Less
Submitted 10 June, 2024; v1 submitted 27 March, 2022;
originally announced March 2022.
-
Modelplasticity and Abductive Decision Making
Authors:
Subhadeep,
Mukhopadhyay
Abstract:
`All models are wrong but some are useful' (George Box 1979). But, how to find those useful ones starting from an imperfect model? How to make informed data-driven decisions equipped with an imperfect model? These fundamental questions appear to be pervasive in virtually all empirical fields -- including economics, finance, marketing, healthcare, climate change, defense planning, and operations re…
▽ More
`All models are wrong but some are useful' (George Box 1979). But, how to find those useful ones starting from an imperfect model? How to make informed data-driven decisions equipped with an imperfect model? These fundamental questions appear to be pervasive in virtually all empirical fields -- including economics, finance, marketing, healthcare, climate change, defense planning, and operations research. This article presents a modern approach (builds on two core ideas: abductive thinking and density-sharpening principle) and practical guidelines to tackle these issues in a systematic manner.
△ Less
Submitted 7 March, 2023; v1 submitted 6 March, 2022;
originally announced March 2022.
-
Abductive Inference and C. S. Peirce: 150 Years Later
Authors:
Deep Mukhopadhyay
Abstract:
This paper is about two things: (i) Charles Sanders Peirce (1837-1914) -- an iconoclastic philosopher and polymath who is among the greatest of American minds. (ii) Abductive inference -- a term coined by C. S. Peirce, which he defined as "the process of forming explanatory hypotheses. It is the only logical operation which introduces any new idea."
Abductive inference and quantitative economics…
▽ More
This paper is about two things: (i) Charles Sanders Peirce (1837-1914) -- an iconoclastic philosopher and polymath who is among the greatest of American minds. (ii) Abductive inference -- a term coined by C. S. Peirce, which he defined as "the process of forming explanatory hypotheses. It is the only logical operation which introduces any new idea."
Abductive inference and quantitative economics: Abductive inference plays a fundamental role in empirical scientific research as a tool for discovery and data analysis. Heckman and Singer (2017) strongly advocated "Economists should abduct." Arnold Zellner (2007) stressed that "much greater emphasis on reductive [abductive] inference in teaching econometrics, statistics, and economics would be desirable." But currently, there are no established theory or practical tools that can allow an empirical analyst to abduct. This paper attempts to fill this gap by introducing new principles and concrete procedures to the Economics and Statistics community. I termed the proposed approach as Abductive Inference Machine (AIM).
The historical Peirce's experiment: In 1872, Peirce conducted a series of experiments to determine the distribution of response times to an auditory stimulus, which is widely regarded as one of the most significant statistical investigations in the history of nineteenth-century American mathematical research (Stigler, 1978). On the 150th anniversary of this historical experiment, we look back at the Peircean-style abductive inference through a modern statistical lens. Using Peirce's data, it is shown how empirical analysts can abduct in a systematic and automated manner using AIM.
△ Less
Submitted 2 February, 2023; v1 submitted 15 November, 2021;
originally announced November 2021.
-
A Maximum Entropy Copula Model for Mixed Data: Representation, Estimation, and Applications
Authors:
Subhadeep,
Mukhopadhyay
Abstract:
A new nonparametric model of maximum-entropy (MaxEnt) copula density function is proposed, which offers the following advantages: (i) it is valid for mixed random vector. By `mixed' we mean the method works for any combination of discrete or continuous variables in a fully automated manner; (ii) it yields a bonafide density estimate with intepretable parameters. By `bonafide' we mean the estimate…
▽ More
A new nonparametric model of maximum-entropy (MaxEnt) copula density function is proposed, which offers the following advantages: (i) it is valid for mixed random vector. By `mixed' we mean the method works for any combination of discrete or continuous variables in a fully automated manner; (ii) it yields a bonafide density estimate with intepretable parameters. By `bonafide' we mean the estimate guarantees to be a non-negative function, integrates to 1; and (iii) it plays a unifying role in our understanding of a large class of statistical methods. Our approach utilizes modern machinery of nonparametric statistics to represent and approximate log-copula density function via LP-Fourier transform. Several real-data examples are also provided to explore the key theoretical and practical implications of the theory.
△ Less
Submitted 22 August, 2022; v1 submitted 21 August, 2021;
originally announced August 2021.
-
InfoGram and Admissible Machine Learning
Authors:
Subhadeep Mukhopadhyay
Abstract:
We have entered a new era of machine learning (ML), where the most accurate algorithm with superior predictive power may not even be deployable, unless it is admissible under the regulatory constraints. This has led to great interest in developing fair, transparent and trustworthy ML methods. The purpose of this article is to introduce a new information-theoretic learning framework (admissible mac…
▽ More
We have entered a new era of machine learning (ML), where the most accurate algorithm with superior predictive power may not even be deployable, unless it is admissible under the regulatory constraints. This has led to great interest in developing fair, transparent and trustworthy ML methods. The purpose of this article is to introduce a new information-theoretic learning framework (admissible machine learning) and algorithmic risk-management tools (InfoGram, L-features, ALFA-testing) that can guide an analyst to redesign off-the-shelf ML methods to be regulatory compliant, while maintaining good prediction accuracy. We have illustrated our approach using several real-data examples from financial sectors, biomedical research, marketing campaigns, and the criminal justice system.
△ Less
Submitted 19 August, 2021; v1 submitted 16 August, 2021;
originally announced August 2021.
-
Density Sharpening: Principles and Applications to Discrete Data Analysis
Authors:
Subhadeep Mukhopadhyay
Abstract:
This article introduces a general statistical modeling principle called "Density Sharpening" and applies it to the analysis of discrete count data. The underlying foundation is based on a new theory of nonparametric approximation and smoothing methods for discrete distributions which play a useful role in explaining and uniting a large class of applied statistical methods. The proposed modeling fr…
▽ More
This article introduces a general statistical modeling principle called "Density Sharpening" and applies it to the analysis of discrete count data. The underlying foundation is based on a new theory of nonparametric approximation and smoothing methods for discrete distributions which play a useful role in explaining and uniting a large class of applied statistical methods. The proposed modeling framework is illustrated using several real applications, from seismology to healthcare to physics.
△ Less
Submitted 21 August, 2021; v1 submitted 16 August, 2021;
originally announced August 2021.
-
Testing for the Network Small-World Property
Authors:
Kartik Lovekar,
Srijan Sengupta,
Subhadeep Paul
Abstract:
Researchers have long observed that the ``small-world" property, which combines the concepts of high transitivity or clustering with a low average path length, is ubiquitous for networks obtained from a variety of disciplines, including social sciences, biology, neuroscience, and ecology. However, we find several shortcomings of the currently prevalent definition and detection methods rendering th…
▽ More
Researchers have long observed that the ``small-world" property, which combines the concepts of high transitivity or clustering with a low average path length, is ubiquitous for networks obtained from a variety of disciplines, including social sciences, biology, neuroscience, and ecology. However, we find several shortcomings of the currently prevalent definition and detection methods rendering the concept less powerful. First, the widely used \textit{small world coefficient} metric combines high transitivity with a low average path length in a single measure that confounds the two separate aspects. We find that the value of the metric is dominated by transitivity, and in several cases, networks get flagged as ``small world" solely because of their high transitivity. Second, the detection methods lack a formal statistical inference. Third, the comparison is typically performed against simplistic random graph models as the baseline, ignoring well-known network characteristics and risks confounding the small world property with other network properties. We decouple the properties of high transitivity and low average path length as separate events to test for. Then we define the property as a statistical test between a suitable null hypothesis and a superimposed alternative hypothesis. We propose a parametric bootstrap test with several null hypothesis models to allow a wide range of background structures in the network. In addition to the bootstrap tests, we also propose an asymptotic test under the Erdös-Renýi null model for which we provide theoretical guarantees on the asymptotic level and power. Our theoretical results include asymptotic distributions of clustering coefficient for various asymptotic growth rates on the probability of an edge. Applying the proposed methods to a large number of network datasets, we uncover new insights about their small-world property.
△ Less
Submitted 8 October, 2024; v1 submitted 14 March, 2021;
originally announced March 2021.
-
O(1) Communication for Distributed SGD through Two-Level Gradient Averaging
Authors:
Subhadeep Bhattacharya,
Weikuan Yu,
Fahim Tahmid Chowdhury
Abstract:
Large neural network models present a hefty communication challenge to distributed Stochastic Gradient Descent (SGD), with a communication complexity of O(n) per worker for a model of n parameters. Many sparsification and quantization techniques have been proposed to compress the gradients, some reducing the communication complexity to O(k), where k << n. In this paper, we introduce a strategy cal…
▽ More
Large neural network models present a hefty communication challenge to distributed Stochastic Gradient Descent (SGD), with a communication complexity of O(n) per worker for a model of n parameters. Many sparsification and quantization techniques have been proposed to compress the gradients, some reducing the communication complexity to O(k), where k << n. In this paper, we introduce a strategy called two-level gradient averaging (A2SGD) to consolidate all gradients down to merely two local averages per worker before the computation of two global averages for an updated model. A2SGD also retains local errors to maintain the variance for fast convergence. Our theoretical analysis shows that A2SGD converges similarly like the default distributed SGD algorithm. Our evaluation validates the theoretical conclusion and demonstrates that A2SGD significantly reduces the communication traffic per worker, and improves the overall training time of LSTM-PTB by 3.2x and 23.2x, respectively, compared to Top-K and QSGD. To the best of our knowledge, A2SGD is the first to achieve O(1) communication complexity per worker for distributed SGD.
△ Less
Submitted 15 June, 2020; v1 submitted 12 June, 2020;
originally announced June 2020.
-
Breiman's "Two Cultures" Revisited and Reconciled
Authors:
Subhadeep,
Mukhopadhyay,
Kaijun Wang
Abstract:
In a landmark paper published in 2001, Leo Breiman described the tense standoff between two cultures of data modeling: parametric statistical and algorithmic machine learning. The cultural division between these two statistical learning frameworks has been growing at a steady pace in recent years. What is the way forward? It has become blatantly obvious that this widening gap between "the two cult…
▽ More
In a landmark paper published in 2001, Leo Breiman described the tense standoff between two cultures of data modeling: parametric statistical and algorithmic machine learning. The cultural division between these two statistical learning frameworks has been growing at a steady pace in recent years. What is the way forward? It has become blatantly obvious that this widening gap between "the two cultures" cannot be averted unless we find a way to blend them into a coherent whole. This article presents a solution by establishing a link between the two cultures. Through examples, we describe the challenges and potential gains of this new integrated statistical thinking.
△ Less
Submitted 27 May, 2020;
originally announced May 2020.
-
On The Problem of Relevance in Statistical Inference
Authors:
Subhadeep Mukhopadhyay,
Kaijun Wang
Abstract:
This paper is dedicated to the "50 Years of the Relevance Problem" - a long-neglected topic that begs attention from practical statisticians who are concerned with the problem of drawing inference from large-scale heterogeneous data.
This paper is dedicated to the "50 Years of the Relevance Problem" - a long-neglected topic that begs attention from practical statisticians who are concerned with the problem of drawing inference from large-scale heterogeneous data.
△ Less
Submitted 4 May, 2021; v1 submitted 20 April, 2020;
originally announced April 2020.
-
Nonparametric Universal Copula Modeling
Authors:
Subhadeep Mukhopadhyay,
Emanuel Parzen
Abstract:
To handle the ubiquitous problem of "dependence learning," copulas are quickly becoming a pervasive tool across a wide range of data-driven disciplines encompassing neuroscience, finance, econometrics, genomics, social science, machine learning, healthcare and many more. Copula (or connection) functions were invented in 1959 by Abe Sklar in response to a query of Maurice Frechet. After 60 years, w…
▽ More
To handle the ubiquitous problem of "dependence learning," copulas are quickly becoming a pervasive tool across a wide range of data-driven disciplines encompassing neuroscience, finance, econometrics, genomics, social science, machine learning, healthcare and many more. Copula (or connection) functions were invented in 1959 by Abe Sklar in response to a query of Maurice Frechet. After 60 years, where do we stand now? This article provides a history of the key developments and offers a unified perspective.
△ Less
Submitted 11 December, 2019;
originally announced December 2019.
-
Attentive Modality Hopping Mechanism for Speech Emotion Recognition
Authors:
Seunghyun Yoon,
Subhadeep Dey,
Hwanhee Lee,
Kyomin Jung
Abstract:
In this work, we explore the impact of visual modality in addition to speech and text for improving the accuracy of the emotion detection system. The traditional approaches tackle this task by fusing the knowledge from the various modalities independently for performing emotion classification. In contrast to these approaches, we tackle the problem by introducing an attention mechanism to combine t…
▽ More
In this work, we explore the impact of visual modality in addition to speech and text for improving the accuracy of the emotion detection system. The traditional approaches tackle this task by fusing the knowledge from the various modalities independently for performing emotion classification. In contrast to these approaches, we tackle the problem by introducing an attention mechanism to combine the information. In this regard, we first apply a neural network to obtain hidden representations of the modalities. Then, the attention mechanism is defined to select and aggregate important parts of the video data by conditioning on the audio and text data. Furthermore, the attention mechanism is again applied to attend important parts of the speech and textual data, by considering other modality. Experiments are performed on the standard IEMOCAP dataset using all three modalities (audio, text, and video). The achieved results show a significant improvement of 3.65% in terms of weighted accuracy compared to the baseline system.
△ Less
Submitted 22 April, 2020; v1 submitted 29 November, 2019;
originally announced December 2019.
-
Joint Latent Space Model for Social Networks with Multivariate Attributes
Authors:
Selena Shuo Wang,
Subhadeep Paul,
Paul De Boeck
Abstract:
In many application problems in social, behavioral, and economic sciences, researchers often have data on a social network among a group of individuals along with high dimensional multivariate measurements for each individual. To analyze such networked data structures, we propose a joint Attribute and Person Latent Space Model (APLSM) that summarizes information from the social network and the mul…
▽ More
In many application problems in social, behavioral, and economic sciences, researchers often have data on a social network among a group of individuals along with high dimensional multivariate measurements for each individual. To analyze such networked data structures, we propose a joint Attribute and Person Latent Space Model (APLSM) that summarizes information from the social network and the multiple attribute measurements in a person-attribute joint latent space. We develop a Variational Bayesian Expectation-Maximization estimation algorithm to estimate the posterior distribution of the attribute and person locations in the joint latent space. This methodology allows for effective integration, informative visualization, and prediction of social networks and high dimensional attribute measurements. Using APLSM, we explore the inner workings of the French financial elites based on their social networks and their career, political views, and social status. We observe a division in the social circles of the French elites in accordance with the differences in their individual characteristics.
△ Less
Submitted 1 February, 2021; v1 submitted 26 October, 2019;
originally announced October 2019.
-
CHIP: A Hawkes Process Model for Continuous-time Networks with Scalable and Consistent Estimation
Authors:
Makan Arastuie,
Subhadeep Paul,
Kevin S. Xu
Abstract:
In many application settings involving networks, such as messages between users of an on-line social network or transactions between traders in financial markets, the observed data consist of timestamped relational events, which form a continuous-time network. We propose the Community Hawkes Independent Pairs (CHIP) generative model for such networks. We show that applying spectral clustering to a…
▽ More
In many application settings involving networks, such as messages between users of an on-line social network or transactions between traders in financial markets, the observed data consist of timestamped relational events, which form a continuous-time network. We propose the Community Hawkes Independent Pairs (CHIP) generative model for such networks. We show that applying spectral clustering to an aggregated adjacency matrix constructed from the CHIP model provides consistent community detection for a growing number of nodes and time duration. We also develop consistent and computationally efficient estimators for the model parameters. We demonstrate that our proposed CHIP model and estimation procedure scales to large networks with tens of thousands of nodes and provides superior fits than existing continuous-time network models on several real networks.
△ Less
Submitted 10 November, 2020; v1 submitted 19 August, 2019;
originally announced August 2019.
-
Exploiting semi-supervised training through a dropout regularization in end-to-end speech recognition
Authors:
Subhadeep Dey,
Petr Motlicek,
Trung Bui,
Franck Dernoncourt
Abstract:
In this paper, we explore various approaches for semi supervised learning in an end to end automatic speech recognition (ASR) framework. The first step in our approach involves training a seed model on the limited amount of labelled data. Additional unlabelled speech data is employed through a data selection mechanism to obtain the best hypothesized output, further used to retrain the seed model.…
▽ More
In this paper, we explore various approaches for semi supervised learning in an end to end automatic speech recognition (ASR) framework. The first step in our approach involves training a seed model on the limited amount of labelled data. Additional unlabelled speech data is employed through a data selection mechanism to obtain the best hypothesized output, further used to retrain the seed model. However, uncertainties of the model may not be well captured with a single hypothesis. As opposed to this technique, we apply a dropout mechanism to capture the uncertainty by obtaining multiple hypothesized text transcripts of an speech recording. We assume that the diversity of automatically generated transcripts for an utterance will implicitly increase the reliability of the model. Finally, the data selection process is also applied on these hypothesized transcripts to reduce the uncertainty. Experiments on freely available TEDLIUM corpus and proprietary Adobe's internal dataset show that the proposed approach significantly reduces ASR errors, compared to the baseline model.
△ Less
Submitted 8 August, 2019;
originally announced August 2019.
-
Spectral Graph Analysis: A Unified Explanation and Modern Perspectives
Authors:
Subhadeep Mukhopadhyay,
Kaijun Wang
Abstract:
Complex networks or graphs are ubiquitous in sciences and engineering: biological networks, brain networks, transportation networks, social networks, and the World Wide Web, to name a few. Spectral graph theory provides a set of useful techniques and models for understanding `patterns of interconnectedness' in a graph. Our prime focus in this paper is on the following question: Is there a unified…
▽ More
Complex networks or graphs are ubiquitous in sciences and engineering: biological networks, brain networks, transportation networks, social networks, and the World Wide Web, to name a few. Spectral graph theory provides a set of useful techniques and models for understanding `patterns of interconnectedness' in a graph. Our prime focus in this paper is on the following question: Is there a unified explanation and description of the fundamental spectral graph methods? There are at least two reasons to be interested in this question. Firstly, to gain a much deeper and refined understanding of the basic foundational principles, and secondly, to derive rich consequences with practical significance for algorithm design. However, despite half a century of research, this question remains one of the most formidable open issues, if not the core problem in modern network science. The achievement of this paper is to take a step towards answering this question by discovering a simple, yet universal statistical logic of spectral graph analysis. The prescribed viewpoint appears to be good enough to accommodate almost all existing spectral graph techniques as a consequence of just one single formalism and algorithm.
△ Less
Submitted 21 January, 2019;
originally announced January 2019.
-
Higher-Order Spectral Clustering under Superimposed Stochastic Block Model
Authors:
Subhadeep Paul,
Olgica Milenkovic,
Yuguo Chen
Abstract:
Higher-order motif structures and multi-vertex interactions are becoming increasingly important in studies that aim to improve our understanding of functionalities and evolution patterns of networks. To elucidate the role of higher-order structures in community detection problems over complex networks, we introduce the notion of a Superimposed Stochastic Block Model (SupSBM). The model is based on…
▽ More
Higher-order motif structures and multi-vertex interactions are becoming increasingly important in studies that aim to improve our understanding of functionalities and evolution patterns of networks. To elucidate the role of higher-order structures in community detection problems over complex networks, we introduce the notion of a Superimposed Stochastic Block Model (SupSBM). The model is based on a random graph framework in which certain higher-order structures or subgraphs are generated through an independent hyperedge generation process, and are then replaced with graphs that are superimposed with directed or undirected edges generated by an inhomogeneous random graph model. Consequently, the model introduces controlled dependencies between edges which allow for capturing more realistic network phenomena, namely strong local clustering in a sparse network, short average path length, and community structure. We proceed to rigorously analyze the performance of a number of recently proposed higher-order spectral clustering methods on the SupSBM. In particular, we prove non-asymptotic upper bounds on the misclustering error of spectral community detection for a SupSBM setting in which triangles or 3-uniform hyperedges are superimposed with undirected edges. As part of our analysis, we also derive new bounds on the misclustering error of higher-order spectral clustering methods for the standard SBM and the 3-uniform hypergraph SBM. Furthermore, for a non-uniform hypergraph SBM model in which one directly observes both edges and 3-uniform hyperedges, we obtain a criterion that describes when to perform spectral clustering based on edges and when on hyperedges, based on a function of hyperedge density and observation quality.
△ Less
Submitted 16 December, 2018;
originally announced December 2018.
-
A Nonparametric Approach to High-dimensional k-sample Comparison Problems
Authors:
Subhadeep,
Mukhopadhyay,
Kaijun Wang
Abstract:
High-dimensional k-sample comparison is a common applied problem. We construct a class of easy-to-implement nonparametric distribution-free tests based on new tools and unexplored connections with spectral graph theory. The test is shown to possess various desirable properties along with a characteristic exploratory flavor that has practical consequences. The numerical examples show that our metho…
▽ More
High-dimensional k-sample comparison is a common applied problem. We construct a class of easy-to-implement nonparametric distribution-free tests based on new tools and unexplored connections with spectral graph theory. The test is shown to possess various desirable properties along with a characteristic exploratory flavor that has practical consequences. The numerical examples show that our method works surprisingly well under a broad range of realistic situations.
△ Less
Submitted 8 August, 2019; v1 submitted 3 October, 2018;
originally announced October 2018.
-
A random effects stochastic block model for joint community detection in multiple networks with applications to neuroimaging
Authors:
Subhadeep Paul,
Yuguo Chen
Abstract:
Motivated by multi-subject experiments in neuroimaging studies, we develop a modeling framework for joint community detection in a group of related networks, which can be considered as a sample from a population of networks. The proposed random effects stochastic block model facilitates the study of group differences and subject-specific variations in the community structure. The model proposes a…
▽ More
Motivated by multi-subject experiments in neuroimaging studies, we develop a modeling framework for joint community detection in a group of related networks, which can be considered as a sample from a population of networks. The proposed random effects stochastic block model facilitates the study of group differences and subject-specific variations in the community structure. The model proposes a putative mean community structure which is representative of the group or the population under consideration but is not the community structure of any individual component network. Instead, the community memberships of nodes vary in each component network with a transition matrix, thus modeling the variation in community structure across a group of subjects. To estimate the quantities of interest we propose two methods, a variational EM algorithm, and a model-free "two-step" method based on either spectral or non-negative matrix factorization (NMF). Our NMF based method Co-OSNTF is of independent interest and we study its convergence properties to a stationary point. We also develop a resampling-based hypothesis test for differences in community structure in two populations both at the whole network level and node level. The methodology is applied to a publicly available fMRI dataset from multi-subject experiments involving schizophrenia patients. Our methods reveal an overall putative community structure representative of the group as well as subject-specific variations within each group. Using our network level hypothesis tests we are able to ascertain statistically significant difference in community structure between the two groups, while our node level tests help determine the nodes that are driving the difference.
△ Less
Submitted 21 March, 2020; v1 submitted 6 May, 2018;
originally announced May 2018.
-
Decentralized Nonparametric Multiple Testing
Authors:
Subhadeep Mukhopadhyay
Abstract:
Consider a big data multiple testing task, where, due to storage and computational bottlenecks, one is given a very large collection of p-values by splitting into manageable chunks and distributing over thousands of computer nodes. This paper is concerned with the following question: How can we find the full data multiple testing solution by operating completely independently on individual machine…
▽ More
Consider a big data multiple testing task, where, due to storage and computational bottlenecks, one is given a very large collection of p-values by splitting into manageable chunks and distributing over thousands of computer nodes. This paper is concerned with the following question: How can we find the full data multiple testing solution by operating completely independently on individual machines in parallel, without any data exchange between nodes? This version of the problem tends naturally to arise in a wide range of data-intensive science and industry applications whose methodological solution has not appeared in the literature to date; therefore, we feel it is necessary to undertake such analysis. Based on the nonparametric functional statistical viewpoint of large-scale inference, started in Mukhopadhyay (2016), this paper furnishes a new computing model that brings unexpected simplicity to the design of the algorithm which might otherwise seem daunting using classical approach and notations.
△ Less
Submitted 5 May, 2018;
originally announced May 2018.
-
Fast Counting in Machine Learning Applications
Authors:
Subhadeep Karan,
Matthew Eichhorn,
Blake Hurlburt,
Grant Iraci,
Jaroslaw Zola
Abstract:
We propose scalable methods to execute counting queries in machine learning applications. To achieve memory and computational efficiency, we abstract counting queries and their context such that the counts can be aggregated as a stream. We demonstrate performance and scalability of the resulting approach on random queries, and through extensive experimentation using Bayesian networks learning and…
▽ More
We propose scalable methods to execute counting queries in machine learning applications. To achieve memory and computational efficiency, we abstract counting queries and their context such that the counts can be aggregated as a stream. We demonstrate performance and scalability of the resulting approach on random queries, and through extensive experimentation using Bayesian networks learning and association rule mining. Our methods significantly outperform commonly used ADtrees and hash tables, and are practical alternatives for processing large-scale data.
△ Less
Submitted 7 January, 2019; v1 submitted 12 April, 2018;
originally announced April 2018.
-
Bayesian Modeling via Goodness-of-fit
Authors:
Subhadeep,
Mukhopadhyay,
Douglas Fletcher
Abstract:
The two key issues of modern Bayesian statistics are: (i) establishing principled approach for distilling statistical prior that is consistent with the given data from an initial believable scientific prior; and (ii) development of a Bayes-frequentist consolidated data analysis workflow that is more effective than either of the two separately. In this paper, we propose the idea of "Bayes via goodn…
▽ More
The two key issues of modern Bayesian statistics are: (i) establishing principled approach for distilling statistical prior that is consistent with the given data from an initial believable scientific prior; and (ii) development of a Bayes-frequentist consolidated data analysis workflow that is more effective than either of the two separately. In this paper, we propose the idea of "Bayes via goodness of fit" as a framework for exploring these fundamental questions, in a way that is general enough to embrace almost all of the familiar probability models. Several illustrative examples show the benefit of this new point of view as a practical data analysis tool. Relationship with other Bayesian cultures is also discussed.
△ Less
Submitted 16 April, 2018; v1 submitted 1 February, 2018;
originally announced February 2018.
-
Statistics Educational Challenge in the 21st Century
Authors:
Subhadeep Mukhopadhyay
Abstract:
What do we teach and what should we teach? An honest answer to this question is painful, very painful--what we teach lags decades behind what we practice. How can we reduce this `gap' to prepare a data science workforce of trained next-generation statisticians? This is a challenging open problem that requires many well-thought-out experiments before finding the secret sauce. My goal in this articl…
▽ More
What do we teach and what should we teach? An honest answer to this question is painful, very painful--what we teach lags decades behind what we practice. How can we reduce this `gap' to prepare a data science workforce of trained next-generation statisticians? This is a challenging open problem that requires many well-thought-out experiments before finding the secret sauce. My goal in this article is to lay out some basic principles and guidelines (rather than creating a pseudo-curriculum based on cherry-picked topics) to expedite this process for finding an `objective' solution.
△ Less
Submitted 14 August, 2017;
originally announced August 2017.
-
Spectral and matrix factorization methods for consistent community detection in multi-layer networks
Authors:
Subhadeep Paul,
Yuguo Chen
Abstract:
We consider the problem of estimating a consensus community structure by combining information from multiple layers of a multi-layer network using methods based on the spectral clustering or a low-rank matrix factorization. As a general theme, these "intermediate fusion" methods involve obtaining a low column rank matrix by optimizing an objective function and then using the columns of the matrix…
▽ More
We consider the problem of estimating a consensus community structure by combining information from multiple layers of a multi-layer network using methods based on the spectral clustering or a low-rank matrix factorization. As a general theme, these "intermediate fusion" methods involve obtaining a low column rank matrix by optimizing an objective function and then using the columns of the matrix for clustering. However, the theoretical properties of these methods remain largely unexplored. In the absence of statistical guarantees on the objective functions, it is difficult to determine if the algorithms optimizing the objectives will return good community structures. We investigate the consistency properties of the global optimizer of some of these objective functions under the multi-layer stochastic blockmodel. For this purpose, we derive several new asymptotic results showing consistency of the intermediate fusion techniques along with the spectral clustering of mean adjacency matrix under a high dimensional setup, where the number of nodes, the number of layers and the number of communities of the multi-layer graph grow. Our numerical study shows that the intermediate fusion techniques outperform late fusion methods, namely spectral clustering on aggregate spectral kernel and module allegiance matrix in sparse networks, while they outperform the spectral clustering of mean adjacency matrix in multi-layer networks that contain layers with both homophilic and heterophilic communities.
△ Less
Submitted 3 December, 2018; v1 submitted 24 April, 2017;
originally announced April 2017.
-
Null Models and Community Detection in Multi-Layer Networks
Authors:
Subhadeep Paul,
Yuguo Chen
Abstract:
Multi-layer networks are networks on a set of entities (nodes) with multiple types of relations (edges) among them where each type of relation/interaction is represented as a network layer. As with single layer networks, community detection is an important task in multi-layer networks. A large group of popular community detection methods in networks are based on optimizing a quality function known…
▽ More
Multi-layer networks are networks on a set of entities (nodes) with multiple types of relations (edges) among them where each type of relation/interaction is represented as a network layer. As with single layer networks, community detection is an important task in multi-layer networks. A large group of popular community detection methods in networks are based on optimizing a quality function known as the modularity score, which is a measure of presence of modules or communities in networks. Hence a first step in community detection is defining a suitable modularity score that is appropriate for the network in question. Here we introduce several multi-layer network modularity measures under different null models of the network, motivated by empirical observations in networks from a diverse field of applications. In particular we define the multi-layer configuration model, the multi-layer expected degree model and their various modifications as null models for multi-layer networks to derive different modularities. The proposed modularities are grouped into two categories. The first category, which is based on degree corrected multi-layer stochastic block model, has the multi-layer expected degree model as their null model. The second category, which is based on multi-layer extensions of Newman-Girvan modularity, has the multi-layer configuration model as their null model. These measures are then optimized to detect the optimal community assignment of nodes. We compare the effectiveness of the measures in community detection in simulated networks and then apply them to four real networks.
△ Less
Submitted 9 December, 2020; v1 submitted 1 August, 2016;
originally announced August 2016.
-
Orthogonal symmetric non-negative matrix factorization under the stochastic block model
Authors:
Subhadeep Paul,
Yuguo Chen
Abstract:
We present a method based on the orthogonal symmetric non-negative matrix tri-factorization of the normalized Laplacian matrix for community detection in complex networks. While the exact factorization of a given order may not exist and is NP hard to compute, we obtain an approximate factorization by solving an optimization problem. We establish the connection of the factors obtained through the f…
▽ More
We present a method based on the orthogonal symmetric non-negative matrix tri-factorization of the normalized Laplacian matrix for community detection in complex networks. While the exact factorization of a given order may not exist and is NP hard to compute, we obtain an approximate factorization by solving an optimization problem. We establish the connection of the factors obtained through the factorization to a non-negative basis of an invariant subspace of the estimated matrix, drawing parallel with the spectral clustering. Using such factorization for clustering in networks is motivated by analyzing a block-diagonal Laplacian matrix with the blocks representing the connected components of a graph. The method is shown to be consistent for community detection in graphs generated from the stochastic block model and the degree corrected stochastic block model. Simulation results and real data analysis show the effectiveness of these methods under a wide variety of situations, including sparse and highly heterogeneous graphs where the usual spectral clustering is known to fail. Our method also performs better than the state of the art in popular benchmark network datasets, e.g., the political web blogs and the karate club data.
△ Less
Submitted 17 May, 2016;
originally announced May 2016.
-
Unified Statistical Theory of Spectral Graph Analysis
Authors:
Subhadeep Mukhopadhyay
Abstract:
The goal of this paper is to show that there exists a simple, yet universal statistical logic of spectral graph analysis by recasting it into a nonparametric function estimation problem. The prescribed viewpoint appears to be good enough to accommodate most of the existing spectral graph techniques as a consequence of just one single formalism and algorithm.
The goal of this paper is to show that there exists a simple, yet universal statistical logic of spectral graph analysis by recasting it into a nonparametric function estimation problem. The prescribed viewpoint appears to be good enough to accommodate most of the existing spectral graph techniques as a consequence of just one single formalism and algorithm.
△ Less
Submitted 20 September, 2016; v1 submitted 11 February, 2016;
originally announced February 2016.
-
Large-Scale Mode Identification and Data-Driven Sciences
Authors:
Subhadeep Mukhopadhyay
Abstract:
Bump-hunting or mode identification is a fundamental problem that arises in almost every scientific field of data-driven discovery. Surprisingly, very few data modeling tools are available for automatic (not requiring manual case-by-base investigation), objective (not subjective), and nonparametric (not based on restrictive parametric model assumptions) mode discovery, which can scale to large dat…
▽ More
Bump-hunting or mode identification is a fundamental problem that arises in almost every scientific field of data-driven discovery. Surprisingly, very few data modeling tools are available for automatic (not requiring manual case-by-base investigation), objective (not subjective), and nonparametric (not based on restrictive parametric model assumptions) mode discovery, which can scale to large data sets. This article introduces LPMode--an algorithm based on a new theory for detecting multimodality of a probability density. We apply LPMode to answer important research questions arising in various fields from environmental science, ecology, econometrics, analytical chemistry to astronomy and cancer genomics.
△ Less
Submitted 8 November, 2016; v1 submitted 21 September, 2015;
originally announced September 2015.
-
Nonparametric Distributed Learning Architecture for Big Data: Algorithm and Applications
Authors:
Scott Bruce,
Zeda Li,
Hsiang-Chieh Yang,
Subhadeep Mukhopadhyay
Abstract:
Dramatic increases in the size and complexity of modern datasets have made traditional "centralized" statistical inference prohibitive. In addition to computational challenges associated with big data learning, the presence of numerous data types (e.g. discrete, continuous, categorical, etc.) makes automation and scalability difficult. A question of immediate concern is how to design a data-intens…
▽ More
Dramatic increases in the size and complexity of modern datasets have made traditional "centralized" statistical inference prohibitive. In addition to computational challenges associated with big data learning, the presence of numerous data types (e.g. discrete, continuous, categorical, etc.) makes automation and scalability difficult. A question of immediate concern is how to design a data-intensive statistical inference architecture without changing the basic statistical modeling principles developed for "small" data over the last century. To address this problem, we present MetaLP, a flexible, distributed statistical modeling framework.
△ Less
Submitted 26 February, 2018; v1 submitted 15 August, 2015;
originally announced August 2015.
-
Large Scale Signal Detection: A Unified Perspective
Authors:
Subhadeep Mukhopadhyay
Abstract:
There is an overwhelmingly large literature and algorithms already available on `large scale inference problems' based on different modeling techniques and cultures. Our primary goal in this paper is \emph{not to add one more new methodology} to the existing toolbox but instead (a) to clarify the mystery how these different simultaneous inference methods are \emph{connected}, (b) to provide an alt…
▽ More
There is an overwhelmingly large literature and algorithms already available on `large scale inference problems' based on different modeling techniques and cultures. Our primary goal in this paper is \emph{not to add one more new methodology} to the existing toolbox but instead (a) to clarify the mystery how these different simultaneous inference methods are \emph{connected}, (b) to provide an alternative more intuitive derivation of the formulas that leads to \emph{simpler} expressions, and (c) to develop a \emph{unified} algorithm for practitioners. A detailed discussion on representation, estimation, inference, and model selection is given. Applications to a variety of real and simulated datasets show promise. We end with several future research directions.
△ Less
Submitted 31 March, 2017; v1 submitted 30 July, 2015;
originally announced July 2015.
-
Community detection in multi-relational data with restricted multi-layer stochastic blockmodel
Authors:
Subhadeep Paul,
Yuguo Chen
Abstract:
In recent years there has been an increased interest in statistical analysis of data with multiple types of relations among a set of entities. Such multi-relational data can be represented as multi-layer graphs where the set of vertices represents the entities and multiple types of edges represent the different relations among them. For community detection in multi-layer graphs, we consider two ra…
▽ More
In recent years there has been an increased interest in statistical analysis of data with multiple types of relations among a set of entities. Such multi-relational data can be represented as multi-layer graphs where the set of vertices represents the entities and multiple types of edges represent the different relations among them. For community detection in multi-layer graphs, we consider two random graph models, the multi-layer stochastic blockmodel (MLSBM) and a model with a restricted parameter space, the restricted multi-layer stochastic blockmodel (RMLSBM). We derive consistency results for community assignments of the maximum likelihood estimators (MLEs) in both models where MLSBM is assumed to be the true model, and either the number of nodes or the number of types of edges or both grow. We compare MLEs in the two models with other baseline approaches, such as separate modeling of layers, aggregating the layers and majority voting. RMLSBM is shown to have advantage over MLSBM when either the growth rate of the number of communities is high or the growth rate of the average degree of the component graphs in the multi-graph is low. We also derive minimax rates of error and sharp thresholds for achieving consistency of community detection in both models, which are then used to compare the multi-layer models with a baseline model, the aggregate stochastic block model. The simulation studies and real data applications confirm the superior performance of the multi-layer approaches in comparison to the baseline procedures.
△ Less
Submitted 21 January, 2016; v1 submitted 8 June, 2015;
originally announced June 2015.
-
Strength of Connections in a Random Graph: Definition, Characterization, and Estimation
Authors:
Subhadeep Mukhopadhyay
Abstract:
How can the `affinity' or `strength' of ties of a random graph be characterized and compactly represented? How can concepts like Fourier and inverse-Fourier like transform be developed for graph data? To do so, we introduce a new graph-theoretic function called `Graph Correlation Density Field' (or in short GraField), which differs from the traditional edge probability density-based approaches, to…
▽ More
How can the `affinity' or `strength' of ties of a random graph be characterized and compactly represented? How can concepts like Fourier and inverse-Fourier like transform be developed for graph data? To do so, we introduce a new graph-theoretic function called `Graph Correlation Density Field' (or in short GraField), which differs from the traditional edge probability density-based approaches, to completely characterize tie-strength between graph nodes. Our approach further allows frequency domain analysis, applicable for both directed and undirected random graphs.
△ Less
Submitted 9 December, 2015; v1 submitted 3 December, 2014;
originally announced December 2014.
-
LP Approach to Statistical Modeling
Authors:
Subhadeep Mukhopadhyay,
Emanuel Parzen
Abstract:
We present an approach to statistical data modeling and exploratory data analysis called `LP Statistical Data Science.' It aims to generalize and unify traditional and novel statistical measures, methods, and exploratory tools. This article outlines fundamental concepts along with real-data examples to illustrate how the `LP Statistical Algorithm' can systematically tackle different varieties of d…
▽ More
We present an approach to statistical data modeling and exploratory data analysis called `LP Statistical Data Science.' It aims to generalize and unify traditional and novel statistical measures, methods, and exploratory tools. This article outlines fundamental concepts along with real-data examples to illustrate how the `LP Statistical Algorithm' can systematically tackle different varieties of data types, data patterns, and data structures under a coherent theoretical framework. A fundamental role is played by specially designed orthonormal basis of a random variable X for linear (Hilbert space theory) representation of a general function of X, such as $\mbox{E}[Y \mid X]$.
△ Less
Submitted 11 May, 2014;
originally announced May 2014.
-
LP Mixed Data Science : Outline of Theory
Authors:
Emanuel Parzen,
Subhadeep Mukhopadhyay
Abstract:
This article presents the theoretical foundation of a new frontier of research-`LP Mixed Data Science'-that simultaneously extends and integrates the practice of traditional and novel statistical methods for nonparametric exploratory data modeling, and is applicable to the teaching and training of statistics.
Statistics journals have great difficulty accepting papers unlike those previously publ…
▽ More
This article presents the theoretical foundation of a new frontier of research-`LP Mixed Data Science'-that simultaneously extends and integrates the practice of traditional and novel statistical methods for nonparametric exploratory data modeling, and is applicable to the teaching and training of statistics.
Statistics journals have great difficulty accepting papers unlike those previously published. For statisticians with new big ideas a practical strategy is to publish them in many small applied studies which enables one to provide references to work of others. This essay outlines the many concepts, new theory, and important algorithms of our new culture of statistical science called LP MIXED DATA SCIENCE. It provides comprehensive solutions to problems of data analysis and nonparametric modeling of many variables that are continuous or discrete, which does not yet have a large literature. It develops a new modeling approach to nonparametric estimation of the multivariate copula density. We discuss the theory which we believe is very elegant (and can provide a framework for United Statistical Algorithms, for traditional Small Data methods and Big Data methods).
△ Less
Submitted 6 November, 2013; v1 submitted 3 November, 2013;
originally announced November 2013.
-
CDfdr: A Comparison Density Approach to Local False Discovery Rate Estimation
Authors:
Subhadeep Mukhopadhyay
Abstract:
Efron et al. (2001) proposed empirical Bayes formulation of the frequentist Benjamini and Hochbergs False Discovery Rate method (Benjamini and Hochberg,1995). This article attempts to unify the `two cultures' using concepts of comparison density and distribution function. We have also shown how almost all of the existing local fdr methods can be viewed as proposing various model specification for…
▽ More
Efron et al. (2001) proposed empirical Bayes formulation of the frequentist Benjamini and Hochbergs False Discovery Rate method (Benjamini and Hochberg,1995). This article attempts to unify the `two cultures' using concepts of comparison density and distribution function. We have also shown how almost all of the existing local fdr methods can be viewed as proposing various model specification for comparison density - unifies the vast literature of false discovery methods under one concept and notation.
△ Less
Submitted 11 August, 2013;
originally announced August 2013.
-
Nonlinear Time Series Modeling: A Unified Perspective, Algorithm, and Application
Authors:
Subhadeep Mukhopadhyay,
Emanuel Parzen
Abstract:
A new comprehensive approach to nonlinear time series analysis and modeling is developed in the present paper. We introduce novel data-specific mid-distribution based Legendre Polynomial (LP) like nonlinear transformations of the original time series Y(t) that enables us to adapt all the existing stationary linear Gaussian time series modeling strategy and made it applicable for non-Gaussian and n…
▽ More
A new comprehensive approach to nonlinear time series analysis and modeling is developed in the present paper. We introduce novel data-specific mid-distribution based Legendre Polynomial (LP) like nonlinear transformations of the original time series Y(t) that enables us to adapt all the existing stationary linear Gaussian time series modeling strategy and made it applicable for non-Gaussian and nonlinear processes in a robust fashion. The emphasis of the present paper is on empirical time series modeling via the algorithm LPTime. We demonstrate the effectiveness of our theoretical framework using daily S&P 500 return data between Jan/2/1963 - Dec/31/2009. Our proposed LPTime algorithm systematically discovers all the `stylized facts' of the financial time series automatically all at once, which were previously noted by many researchers one at a time.
△ Less
Submitted 23 December, 2017; v1 submitted 2 August, 2013;
originally announced August 2013.
-
United Statistical Algorithm, Small and Big Data: Future OF Statistician
Authors:
Emanuel Parzen,
Subhadeep Mukhopadhyay
Abstract:
This article provides the role of big idea statisticians in future of Big Data Science. We describe the `United Statistical Algorithms' framework for comprehensive unification of traditional and novel statistical methods for modeling Small Data and Big Data, especially mixed data (discrete, continuous).
This article provides the role of big idea statisticians in future of Big Data Science. We describe the `United Statistical Algorithms' framework for comprehensive unification of traditional and novel statistical methods for modeling Small Data and Big Data, especially mixed data (discrete, continuous).
△ Less
Submitted 2 August, 2013;
originally announced August 2013.
-
Modeling, dependence, classification, united statistical science, many cultures
Authors:
Emanuel Parzen,
Subhadeep Mukhopadhyay
Abstract:
Breiman (2001) proposed to statisticians awareness of two cultures: 1. Parametric modeling culture, pioneered by R.A.Fisher and Jerzy Neyman; 2. Algorithmic predictive culture, pioneered by machine learning research.
Parzen (2001), as a part of discussing Breiman (2001), proposed that researchers be aware of many cultures, including the focus of our research: 3. Nonparametric, quantile based, in…
▽ More
Breiman (2001) proposed to statisticians awareness of two cultures: 1. Parametric modeling culture, pioneered by R.A.Fisher and Jerzy Neyman; 2. Algorithmic predictive culture, pioneered by machine learning research.
Parzen (2001), as a part of discussing Breiman (2001), proposed that researchers be aware of many cultures, including the focus of our research: 3. Nonparametric, quantile based, information theoretic modeling. We provide a unification of many statistical methods for traditional small data sets and emerging big data sets in terms of comparison density, copula density, measure of dependence, correlation, information, new measures (called LP score comoments) that apply to long tailed distributions with out finite second order moments. A very important goal is to unify methods for discrete and continuous random variables. Our research extends these methods to modern high dimensional data modeling.
△ Less
Submitted 23 April, 2012; v1 submitted 20 April, 2012;
originally announced April 2012.
-
Quantile Based Variable Mining : Detection, FDR based Extraction and Interpretation
Authors:
S. Mukhopadhyay,
Emanuel Parzen,
S. N. Lahiri
Abstract:
This paper outlines a unified framework for high dimensional variable selection for classification problems. Traditional approaches to finding interesting variables mostly utilize only partial information through moments (like mean difference). On the contrary, in this paper we address the question of variable selection in full generality from a distributional point of view. If a variable is not i…
▽ More
This paper outlines a unified framework for high dimensional variable selection for classification problems. Traditional approaches to finding interesting variables mostly utilize only partial information through moments (like mean difference). On the contrary, in this paper we address the question of variable selection in full generality from a distributional point of view. If a variable is not important for classification, then it will have similar distributional aspect under different classes. This simple and straightforward observation motivates us to quantify `How and Why' the distribution of a variable changes over classes through CR-statistic. The second contribution of our paper is to develop and investigate the FDR based thresholding technology from a completely new point of view for adaptive thresholding, which leads to a elegant algorithm called CDfdr. This paper attempts to show how all of these problems of detection, extraction and interpretation for interesting variables can be treated in a unified way under one broad general theme - comparison analysis. It is proposed that a key to accomplishing this unification is to think in terms of the quantile function and the comparison density. We illustrate and demonstrate the power of our methodology using three real data sets.
△ Less
Submitted 14 December, 2011;
originally announced December 2011.