Search | arXiv e-print repository

AssistedDS: Benchmarking How External Domain Knowledge Assists LLMs in Automated Data Science

Authors: An Luo, Xun Xian, Jin Du, Fangqiao Tian, Ganghua Wang, Ming Zhong, Shengchun Zhao, Xuan Bi, Zirui Liu, Jiawei Zhou, Jayanth Srinivasa, Ashish Kundu, Charles Fleming, Mingyi Hong, Jie Ding

Abstract: Large language models (LLMs) have advanced the automation of data science workflows. Yet it remains unclear whether they can critically leverage external domain knowledge as human data scientists do in practice. To answer this question, we introduce AssistedDS (Assisted Data Science), a benchmark designed to systematically evaluate how LLMs handle domain knowledge in tabular prediction tasks. Assi… ▽ More Large language models (LLMs) have advanced the automation of data science workflows. Yet it remains unclear whether they can critically leverage external domain knowledge as human data scientists do in practice. To answer this question, we introduce AssistedDS (Assisted Data Science), a benchmark designed to systematically evaluate how LLMs handle domain knowledge in tabular prediction tasks. AssistedDS features both synthetic datasets with explicitly known generative mechanisms and real-world Kaggle competitions, each accompanied by curated bundles of helpful and adversarial documents. These documents provide domain-specific insights into data cleaning, feature engineering, and model selection. We assess state-of-the-art LLMs on their ability to discern and apply beneficial versus harmful domain knowledge, evaluating submission validity, information recall, and predictive performance. Our results demonstrate three key findings: (1) LLMs frequently exhibit an uncritical adoption of provided information, significantly impairing their predictive performance when adversarial content is introduced, (2) helpful guidance is often insufficient to counteract the negative influence of adversarial information, and (3) in Kaggle datasets, LLMs often make errors in handling time-series data, applying consistent feature engineering across different folds, and interpreting categorical variables correctly. These findings highlight a substantial gap in current models' ability to critically evaluate and leverage expert knowledge, underscoring an essential research direction for developing more robust, knowledge-aware automated data science systems. △ Less

Submitted 25 May, 2025; originally announced June 2025.

MSC Class: 62-07; 62-08; 68T05; 68T07; 68T01; 68T50 ACM Class: I.2.0; I.2.6; I.2.7; I.5.1; I.5.4; H.2.8; G.3

arXiv:2503.01077 [pdf, other]

Learning Stochastic Dynamical Systems with Structured Noise

Authors: Ziheng Guo, James Greene, Ming Zhong

Abstract: Stochastic differential equations (SDEs) are a ubiquitous modeling framework that finds applications in physics, biology, engineering, social science, and finance. Due to the availability of large-scale data sets, there is growing interest in learning mechanistic models from observations with stochastic noise. In this work, we present a nonparametric framework to learn both the drift and diffusion… ▽ More Stochastic differential equations (SDEs) are a ubiquitous modeling framework that finds applications in physics, biology, engineering, social science, and finance. Due to the availability of large-scale data sets, there is growing interest in learning mechanistic models from observations with stochastic noise. In this work, we present a nonparametric framework to learn both the drift and diffusion terms in systems of SDEs where the stochastic noise is singular. Specifically, inspired by second-order equations from classical physics, we consider systems which possess structured noise, i.e. noise with a singular covariance matrix. We provide an algorithm for constructing estimators given trajectory data and demonstrate the effectiveness of our methods via a number of examples from physics and biology. As the developed framework is most naturally applicable to systems possessing a high degree of dimensionality reduction (i.e. symmetry), we also apply it to the high dimensional Cucker-Smale flocking model studied in collective dynamics and show that it is able to accurately infer the low dimensional interaction kernel from particle data. △ Less

Submitted 2 March, 2025; originally announced March 2025.

arXiv:2406.08335 [pdf, other]

A Survey of Pipeline Tools for Data Engineering

Authors: Anthony Mbata, Yaji Sripada, Mingjun Zhong

Abstract: Currently, a variety of pipeline tools are available for use in data engineering. Data scientists can use these tools to resolve data wrangling issues associated with data and accomplish some data engineering tasks from data ingestion through data preparation to utilization as input for machine learning (ML). Some of these tools have essential built-in components or can be combined with other tool… ▽ More Currently, a variety of pipeline tools are available for use in data engineering. Data scientists can use these tools to resolve data wrangling issues associated with data and accomplish some data engineering tasks from data ingestion through data preparation to utilization as input for machine learning (ML). Some of these tools have essential built-in components or can be combined with other tools to perform desired data engineering operations. While some tools are wholly or partly commercial, several open-source tools are available to perform expert-level data engineering tasks. This survey examines the broad categories and examples of pipeline tools based on their design and data engineering intentions. These categories are Extract Transform Load/Extract Load Transform (ETL/ELT), pipelines for Data Integration, Ingestion, and Transformation, Data Pipeline Orchestration and Workflow Management, and Machine Learning Pipelines. The survey also provides a broad outline of the utilization with examples within these broad groups and finally, a discussion is presented with case studies indicating the usage of pipeline tools for data engineering. The studies present some first-user application experiences with sample data, some complexities of the applied pipeline, and a summary note of approaches to using these tools to prepare data for machine learning. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: 18 pages, 7 figures

arXiv:2010.03729 [pdf, other]

Learning Theory for Inferring Interaction Kernels in Second-Order Interacting Agent Systems

Authors: Jason Miller, Sui Tang, Ming Zhong, Mauro Maggioni

Abstract: Modeling the complex interactions of systems of particles or agents is a fundamental scientific and mathematical problem that is studied in diverse fields, ranging from physics and biology, to economics and machine learning. In this work, we describe a very general second-order, heterogeneous, multivariable, interacting agent model, with an environment, that encompasses a wide variety of known sys… ▽ More Modeling the complex interactions of systems of particles or agents is a fundamental scientific and mathematical problem that is studied in diverse fields, ranging from physics and biology, to economics and machine learning. In this work, we describe a very general second-order, heterogeneous, multivariable, interacting agent model, with an environment, that encompasses a wide variety of known systems. We describe an inference framework that uses nonparametric regression and approximation theory based techniques to efficiently derive estimators of the interaction kernels which drive these dynamical systems. We develop a complete learning theory which establishes strong consistency and optimal nonparametric min-max rates of convergence for the estimators, as well as provably accurate predicted trajectories. The estimators exploit the structure of the equations in order to overcome the curse of dimensionality and we describe a fundamental coercivity condition on the inverse problem which ensures that the kernels can be learned and relates to the minimal singular value of the learning matrix. The numerical algorithm presented to build the estimators is parallelizable, performs well on high-dimensional problems, and is demonstrated on complex dynamical systems. △ Less

Submitted 7 October, 2020; originally announced October 2020.

Comments: 68 pages

MSC Class: 62Gxx; 37Nxx; 68Txx

arXiv:1912.11123 [pdf, other]

Data-driven Discovery of Emergent Behaviors in Collective Dynamics

Authors: Mauro Maggioni, Jason Miller, Ming Zhong

Abstract: Particle- and agent-based systems are a ubiquitous modeling tool in many disciplines. We consider the fundamental problem of inferring interaction kernels from observations of agent-based dynamical systems given observations of trajectories, in particular for collective dynamical systems exhibiting emergent behaviors with complicated interaction kernels, in a nonparametric fashion, and for kernels… ▽ More Particle- and agent-based systems are a ubiquitous modeling tool in many disciplines. We consider the fundamental problem of inferring interaction kernels from observations of agent-based dynamical systems given observations of trajectories, in particular for collective dynamical systems exhibiting emergent behaviors with complicated interaction kernels, in a nonparametric fashion, and for kernels which are parametrized by a single unknown parameter. We extend the estimators introduced in \cite{PNASLU}, which are based on suitably regularized least squares estimators, to these larger classes of systems. We provide extensive numerical evidence that the estimators provide faithful approximations to the interaction kernels, and provide accurate predictions for trajectories started at new initial conditions, both throughout the ``training'' time interval in which the observations were made, and often much beyond. We demonstrate these features on prototypical systems displaying collective behaviors, ranging from opinion dynamics, flocking dynamics, self-propelling particle dynamics, synchronized oscillator dynamics, and a gravitational system. Our experiments also suggest that our estimated systems can display the same emergent behaviors of the observed systems, that occur at larger timescales than those used in the training data. Finally, in the case of families of systems governed by a parameterized family of interaction kernels, we introduce novel estimators that estimate the parameterized family of kernels, splitting it into a common interaction kernel and the action of parameters. We demonstrate this in the case of gravity, by learning both the ``common component'' $1/r^2$ and the dependency on mass, without any a priori knowledge of either one, from observations of planetary motions in our solar system. △ Less

Submitted 30 March, 2020; v1 submitted 23 December, 2019; originally announced December 2019.

arXiv:1910.01578 [pdf, other]

GDP: Generalized Device Placement for Dataflow Graphs

Authors: Yanqi Zhou, Sudip Roy, Amirali Abdolrashidi, Daniel Wong, Peter C. Ma, Qiumin Xu, Ming Zhong, Hanxiao Liu, Anna Goldie, Azalia Mirhoseini, James Laudon

Abstract: Runtime and scalability of large neural networks can be significantly affected by the placement of operations in their dataflow graphs on suitable devices. With increasingly complex neural network architectures and heterogeneous device characteristics, finding a reasonable placement is extremely challenging even for domain experts. Most existing automated device placement approaches are impractica… ▽ More Runtime and scalability of large neural networks can be significantly affected by the placement of operations in their dataflow graphs on suitable devices. With increasingly complex neural network architectures and heterogeneous device characteristics, finding a reasonable placement is extremely challenging even for domain experts. Most existing automated device placement approaches are impractical due to the significant amount of compute required and their inability to generalize to new, previously held-out graphs. To address both limitations, we propose an efficient end-to-end method based on a scalable sequential attention mechanism over a graph neural network that is transferable to new graphs. On a diverse set of representative deep learning models, including Inception-v3, AmoebaNet, Transformer-XL, and WaveNet, our method on average achieves 16% improvement over human experts and 9.2% improvement over the prior art with 15 times faster convergence. To further reduce the computation cost, we pre-train the policy network on a set of dataflow graphs and use a superposition network to fine-tune it on each individual graph, achieving state-of-the-art performance on large hold-out graphs with over 50k nodes, such as an 8-layer GNMT. △ Less

Submitted 28 September, 2019; originally announced October 2019.

arXiv:1909.12301 [pdf, other]

doi 10.1145/3357384.3357892

DBRec: Dual-Bridging Recommendation via Discovering Latent Groups

Authors: Jingwei Ma, Jiahui Wen, Mingyang Zhong, Liangchen Liu, Chaojie Li, Weitong Chen, Yin Yang, Honghui Tu, Xue Li

Abstract: In recommender systems, the user-item interaction data is usually sparse and not sufficient for learning comprehensive user/item representations for recommendation. To address this problem, we propose a novel dual-bridging recommendation model (DBRec). DBRec performs latent user/item group discovery simultaneously with collaborative filtering, and interacts group information with users/items for b… ▽ More In recommender systems, the user-item interaction data is usually sparse and not sufficient for learning comprehensive user/item representations for recommendation. To address this problem, we propose a novel dual-bridging recommendation model (DBRec). DBRec performs latent user/item group discovery simultaneously with collaborative filtering, and interacts group information with users/items for bridging similar users/items. Therefore, a user's preference over an unobserved item, in DBRec, can be bridged by the users within the same group who have rated the item, or the user-rated items that share the same group with the unobserved item. In addition, we propose to jointly learn user-user group (item-item group) hierarchies, so that we can effectively discover latent groups and learn compact user/item representations. We jointly integrate collaborative filtering, latent group discovering and hierarchical modelling into a unified framework, so that all the model parameters can be learned toward the optimization of the objective function. We validate the effectiveness of the proposed model with two real datasets, and demonstrate its advantage over the state-of-the-art recommendation models with extensive experiments. △ Less

Submitted 16 October, 2019; v1 submitted 26 September, 2019; originally announced September 2019.

Comments: 10 pages, 16 figures, The 28th ACM International Conference on Information and Knowledge Management (CIKM '19)

arXiv:1907.04710 [pdf, other]

Trust-Region Variational Inference with Gaussian Mixture Models

Authors: Oleg Arenz, Mingjun Zhong, Gerhard Neumann

Abstract: Many methods for machine learning rely on approximate inference from intractable probability distributions. Variational inference approximates such distributions by tractable models that can be subsequently used for approximate inference. Learning sufficiently accurate approximations requires a rich model family and careful exploration of the relevant modes of the target distribution. We propose a… ▽ More Many methods for machine learning rely on approximate inference from intractable probability distributions. Variational inference approximates such distributions by tractable models that can be subsequently used for approximate inference. Learning sufficiently accurate approximations requires a rich model family and careful exploration of the relevant modes of the target distribution. We propose a method for learning accurate GMM approximations of intractable probability distributions based on insights from policy search by using information-geometric trust regions for principled exploration. For efficient improvement of the GMM approximation, we derive a lower bound on the corresponding optimization objective enabling us to update the components independently. Our use of the lower bound ensures convergence to a stationary point of the original objective. The number of components is adapted online by adding new components in promising regions and by deleting components with negligible weight. We demonstrate on several domains that we can learn approximations of complex, multimodal distributions with a quality that is unmet by previous variational inference methods, and that the GMM approximation can be used for drawing samples that are on par with samples created by state-of-the-art MCMC samplers while requiring up to three orders of magnitude less computational resources. △ Less

Submitted 4 August, 2020; v1 submitted 10 July, 2019; originally announced July 2019.

Journal ref: Journal of Machine Learning Research. 21(163):1-60, 2020

arXiv:1902.08835 [pdf, other]

Transfer Learning for Non-Intrusive Load Monitoring

Authors: Michele DIncecco, Stefano Squartini, Mingjun Zhong

Abstract: Non-intrusive load monitoring (NILM) is a technique to recover source appliances from only the recorded mains in a household. NILM is unidentifiable and thus a challenge problem because the inferred power value of an appliance given only the mains could not be unique. To mitigate the unidentifiable problem, various methods incorporating domain knowledge into NILM have been proposed and shown effec… ▽ More Non-intrusive load monitoring (NILM) is a technique to recover source appliances from only the recorded mains in a household. NILM is unidentifiable and thus a challenge problem because the inferred power value of an appliance given only the mains could not be unique. To mitigate the unidentifiable problem, various methods incorporating domain knowledge into NILM have been proposed and shown effective experimentally. Recently, among these methods, deep neural networks are shown performing best. Arguably, the recently proposed sequence-to-point (seq2point) learning is promising for NILM. However, the results were only carried out on the same data domain. It is not clear if the method could be generalised or transferred to different domains, e.g., the test data were drawn from a different country comparing to the training data. We address this issue in the paper, and two transfer learning schemes are proposed, i.e., appliance transfer learning (ATL) and cross-domain transfer learning (CTL). For ATL, our results show that the latent features learnt by a `complex' appliance, e.g., washing machine, can be transferred to a `simple' appliance, e.g., kettle. For CTL, our conclusion is that the seq2point learning is transferable. Precisely, when the training and test data are in a similar domain, seq2point learning can be directly applied to the test data without fine tuning; when the training and test data are in different domains, seq2point learning needs fine tuning before applying to the test data. Interestingly, we show that only the fully connected layers need fine tuning for transfer learning. Source code can be found at https://github.com/MingjunZhong/transferNILM. △ Less

Submitted 13 September, 2019; v1 submitted 23 February, 2019; originally announced February 2019.

Comments: 10 pages, 12 Figures

Journal ref: IEEE Transactions on Smart Grid, 2019

arXiv:1812.06003 [pdf, other]

doi 10.1073/pnas.1822012116

Nonparametric inference of interaction laws in systems of agents from trajectory data

Authors: Fei Lu, Mauro Maggioni, Sui Tang, Ming Zhong

Abstract: Inferring the laws of interaction between particles and agents in complex dynamical systems from observational data is a fundamental challenge in a wide variety of disciplines. We propose a non-parametric statistical learning approach to estimate the governing laws of distance-based interactions, with no reference or assumption about their analytical form, from data consisting trajectories of inte… ▽ More Inferring the laws of interaction between particles and agents in complex dynamical systems from observational data is a fundamental challenge in a wide variety of disciplines. We propose a non-parametric statistical learning approach to estimate the governing laws of distance-based interactions, with no reference or assumption about their analytical form, from data consisting trajectories of interacting agents. We demonstrate the effectiveness of our learning approach both by providing theoretical guarantees, and by testing the approach on a variety of prototypical systems in various disciplines. These systems include homogeneous and heterogeneous agents systems, ranging from particle systems in fundamental physics to agent-based systems modeling opinion dynamics under the social influence, prey-predator dynamics, flocking and swarming, and phototaxis in cell dynamics. △ Less

Submitted 23 March, 2019; v1 submitted 14 December, 2018; originally announced December 2018.

arXiv:1810.09253 [pdf]

doi 10.23919/CIC.2016.7868812

Classification of normal/abnormal heart sound recordings based on multi-domain features and back propagation neural network

Authors: Hong Tang, Huaming Chen, Ting Li, Mingjun Zhong

Abstract: This paper aims to classify a single PCG recording as normal or abnormal for computer-aided diagnosis. The proposed framework for this challenge has four steps: preprocessing, feature extraction, training and validation. In the preprocessing step, a recording is segmented into four states, i.e., the first heart sound, systolic interval, the second heart sound, and diastolic interval by the Springe… ▽ More This paper aims to classify a single PCG recording as normal or abnormal for computer-aided diagnosis. The proposed framework for this challenge has four steps: preprocessing, feature extraction, training and validation. In the preprocessing step, a recording is segmented into four states, i.e., the first heart sound, systolic interval, the second heart sound, and diastolic interval by the Springer Segmentation algorithm. In the feature extraction step, the authors extract 324 features from multi-domains to perform classification. A back propagation neural network is used as predication model. The optimal threshold for distinguishing normal and abnormal is determined by the statistics of model output for both normal and abnormal. The performance of the proposed predictor tested by the six training sets is sensitivity 0.812 and specificity 0.860 (overall accuracy is 0.836). However, the performance reduces to sensitivity 0.807 and specificity 0.829 (overall accuracy is 0.818) for the hidden test set. △ Less

Submitted 17 October, 2018; originally announced October 2018.

Comments: 4 pages

Journal ref: 2016 Computing in Cardiology Conference (CinC), IEEE, Vancouver, BC, 2016, pp. 593-596

arXiv:1806.00159 [pdf, other]

Neural Control Variates for Variance Reduction

Authors: Ruosi Wan, Mingjun Zhong, Haoyi Xiong, Zhanxing Zhu

Abstract: In statistics and machine learning, approximation of an intractable integration is often achieved by using the unbiased Monte Carlo estimator, but the variances of the estimation are generally high in many applications. Control variates approaches are well-known to reduce the variance of the estimation. These control variates are typically constructed by employing predefined parametric functions o… ▽ More In statistics and machine learning, approximation of an intractable integration is often achieved by using the unbiased Monte Carlo estimator, but the variances of the estimation are generally high in many applications. Control variates approaches are well-known to reduce the variance of the estimation. These control variates are typically constructed by employing predefined parametric functions or polynomials, determined by using those samples drawn from the relevant distributions. Instead, we propose to construct those control variates by learning neural networks to handle the cases when test functions are complex. In many applications, obtaining a large number of samples for Monte Carlo estimation is expensive, which may result in overfitting when training a neural network. We thus further propose to employ auxiliary random variables induced by the original ones to extend data samples for training the neural networks. We apply the proposed control variates with augmented variables to thermodynamic integration and reinforcement learning. Experimental results demonstrate that our method can achieve significant variance reduction compared with other alternatives. △ Less

Submitted 15 October, 2019; v1 submitted 31 May, 2018; originally announced June 2018.

Comments: Published as a conference paper at ECML PKDD 2019

arXiv:1612.09106 [pdf, other]

Sequence-to-point learning with neural networks for nonintrusive load monitoring

Authors: Chaoyun Zhang, Mingjun Zhong, Zongzuo Wang, Nigel Goddard, Charles Sutton

Abstract: Energy disaggregation (a.k.a nonintrusive load monitoring, NILM), a single-channel blind source separation problem, aims to decompose the mains which records the whole house electricity consumption into appliance-wise readings. This problem is difficult because it is inherently unidentifiable. Recent approaches have shown that the identifiability problem could be reduced by introducing domain know… ▽ More Energy disaggregation (a.k.a nonintrusive load monitoring, NILM), a single-channel blind source separation problem, aims to decompose the mains which records the whole house electricity consumption into appliance-wise readings. This problem is difficult because it is inherently unidentifiable. Recent approaches have shown that the identifiability problem could be reduced by introducing domain knowledge into the model. Deep neural networks have been shown to be a promising approach for these problems, but sliding windows are necessary to handle the long sequences which arise in signal processing problems, which raises issues about how to combine predictions from different sliding windows. In this paper, we propose sequence-to-point learning, where the input is a window of the mains and the output is a single point of the target appliance. We use convolutional neural networks to train the model. Interestingly, we systematically show that the convolutional neural networks can inherently learn the signatures of the target appliances, which are automatically added into the model to reduce the identifiability problem. We applied the proposed neural network approaches to real-world household energy data, and show that the methods achieve state-of-the-art performance, improving two standard error measures by 84% and 92%. △ Less

Submitted 18 September, 2017; v1 submitted 29 December, 2016; originally announced December 2016.

Comments: 8 pages, 3 figures

Journal ref: The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18), 2018

arXiv:1510.09130 [pdf, other]

Latent Bayesian melding for integrating individual and population models

Authors: Mingjun Zhong, Nigel Goddard, Charles Sutton

Abstract: In many statistical problems, a more coarse-grained model may be suitable for population-level behaviour, whereas a more detailed model is appropriate for accurate modelling of individual behaviour. This raises the question of how to integrate both types of models. Methods such as posterior regularization follow the idea of generalized moment matching, in that they allow matching expectations betw… ▽ More In many statistical problems, a more coarse-grained model may be suitable for population-level behaviour, whereas a more detailed model is appropriate for accurate modelling of individual behaviour. This raises the question of how to integrate both types of models. Methods such as posterior regularization follow the idea of generalized moment matching, in that they allow matching expectations between two models, but sometimes both models are most conveniently expressed as latent variable models. We propose latent Bayesian melding, which is motivated by averaging the distributions over populations statistics of both the individual-level and the population-level models under a logarithmic opinion pool framework. In a case study on electricity disaggregation, which is a type of single-channel blind source separation problem, we show that latent Bayesian melding leads to significantly more accurate predictions than an approach based solely on generalized moment matching. △ Less

Submitted 30 October, 2015; originally announced October 2015.

Comments: 11 pages, Advances in Neural Information Processing Systems (NIPS), 2015. (Spotlight Presentation)

arXiv:1502.00231 [pdf, ps, other]

Feature Selection with Redundancy-complementariness Dispersion

Authors: Zhijun Chen, Chaozhong Wu, Yishi Zhang, Zhen Huang, Bin Ran, Ming Zhong, Nengchao Lyu

Abstract: Feature selection has attracted significant attention in data mining and machine learning in the past decades. Many existing feature selection methods eliminate redundancy by measuring pairwise inter-correlation of features, whereas the complementariness of features and higher inter-correlation among more than two features are ignored. In this study, a modification item concerning the complementar… ▽ More Feature selection has attracted significant attention in data mining and machine learning in the past decades. Many existing feature selection methods eliminate redundancy by measuring pairwise inter-correlation of features, whereas the complementariness of features and higher inter-correlation among more than two features are ignored. In this study, a modification item concerning the complementariness of features is introduced in the evaluation criterion of features. Additionally, in order to identify the interference effect of already-selected False Positives (FPs), the redundancy-complementariness dispersion is also taken into account to adjust the measurement of pairwise inter-correlation of features. To illustrate the effectiveness of proposed method, classification experiments are applied with four frequently used classifiers on ten datasets. Classification results verify the superiority of proposed method compared with five representative feature selection methods. △ Less

Submitted 1 February, 2015; originally announced February 2015.

Comments: 28 pages, 13 figures, 7 tables

MSC Class: 68T10; 94A17; 62B10; 68U35 ACM Class: I.5.2; H.1.1

arXiv:1406.7665 [pdf, other]

Interleaved Factorial Non-Homogeneous Hidden Markov Models for Energy Disaggregation

Authors: Mingjun Zhong, Nigel Goddard, Charles Sutton

Abstract: To reduce energy demand in households it is useful to know which electrical appliances are in use at what times. Monitoring individual appliances is costly and intrusive, whereas data on overall household electricity use is more easily obtained. In this paper, we consider the energy disaggregation problem where a household's electricity consumption is disaggregated into the component appliances. T… ▽ More To reduce energy demand in households it is useful to know which electrical appliances are in use at what times. Monitoring individual appliances is costly and intrusive, whereas data on overall household electricity use is more easily obtained. In this paper, we consider the energy disaggregation problem where a household's electricity consumption is disaggregated into the component appliances. The factorial hidden Markov model (FHMM) is a natural model to fit this data. We enhance this generic model by introducing two constraints on the state sequence of the FHMM. The first is to use a non-homogeneous Markov chain, modelling how appliance usage varies over the day, and the other is to enforce that at most one chain changes state at each time step. This yields a new model which we call the interleaved factorial non-homogeneous hidden Markov model (IFNHMM). We evaluated the ability of this model to perform disaggregation in an ultra-low frequency setting, over a data set of 251 English households. In this new setting, the IFNHMM outperforms the FHMM in terms of recovering the energy used by the component appliances, due to that stronger constraints have been imposed on the states of the hidden Markov chains. Interestingly, we find that the variability in model performance across households is significant, underscoring the importance of using larger scale data in the disaggregation problem. △ Less

Submitted 30 June, 2014; originally announced June 2014.

Comments: 5 pages, 1 figure, conference, The NIPS workshop on Machine Learning for Sustainability, Lake Tahoe, NV, USA, 2013

arXiv:1210.3456 [pdf, other]

Bayesian Analysis for miRNA and mRNA Interactions Using Expression Data

Authors: Mingjun Zhong, Rong Liu, Bo Liu

Abstract: MicroRNAs (miRNAs) are small RNA molecules composed of 19-22 nt, which play important regulatory roles in post-transcriptional gene regulation by inhibiting the translation of the mRNA into proteins or otherwise cleaving the target mRNA. Inferring miRNA targets provides useful information for understanding the roles of miRNA in biological processes that are potentially involved in complex diseases… ▽ More MicroRNAs (miRNAs) are small RNA molecules composed of 19-22 nt, which play important regulatory roles in post-transcriptional gene regulation by inhibiting the translation of the mRNA into proteins or otherwise cleaving the target mRNA. Inferring miRNA targets provides useful information for understanding the roles of miRNA in biological processes that are potentially involved in complex diseases. Statistical methodologies for point estimation, such as the Least Absolute Shrinkage and Selection Operator (LASSO) algorithm, have been proposed to identify the interactions of miRNA and mRNA based on sequence and expression data. In this paper, we propose using the Bayesian LASSO (BLASSO) and the non-negative Bayesian LASSO (nBLASSO) to analyse the interactions between miRNA and mRNA using expression data. The proposed Bayesian methods explore the posterior distributions for those parameters required to model the miRNA-mRNA interactions. These approaches can be used to observe the inferred effects of the miRNAs on the targets by plotting the posterior distributions of those parameters. For comparison purposes, the Least Squares Regression (LSR), Ridge Regression (RR), LASSO, non-negative LASSO (nLASSO), and the proposed Bayesian approaches were applied to four public datasets. We concluded that nLASSO and nBLASSO perform best in terms of sensitivity and specificity. Compared to the point estimate algorithms, which only provide single estimates for those parameters, the Bayesian methods are more meaningful and provide credible intervals, which take into account the uncertainty of the inferred interactions of the miRNA and mRNA. Furthermore, Bayesian methods naturally provide statistical significance to select convincing inferred interactions, while point estimate algorithms require a manually chosen threshold, which is less meaningful, to choose the possible interactions. △ Less

Submitted 30 June, 2014; v1 submitted 12 October, 2012; originally announced October 2012.

Comments: 21 pages, 11 figures, 8 tables

arXiv:1206.4666 [pdf]

A Bayesian Approach to Approximate Joint Diagonalization of Square Matrices

Authors: Mingjun Zhong, Mark Girolami

Abstract: We present a Bayesian scheme for the approximate diagonalisation of several square matrices which are not necessarily symmetric. A Gibbs sampler is derived to simulate samples of the common eigenvectors and the eigenvalues for these matrices. Several synthetic examples are used to illustrate the performance of the proposed Gibbs sampler and we then provide comparisons to several other joint diagon… ▽ More We present a Bayesian scheme for the approximate diagonalisation of several square matrices which are not necessarily symmetric. A Gibbs sampler is derived to simulate samples of the common eigenvectors and the eigenvalues for these matrices. Several synthetic examples are used to illustrate the performance of the proposed Gibbs sampler and we then provide comparisons to several other joint diagonalization algorithms, which shows that the Gibbs sampler achieves the state-of-the-art performance on the examples considered. As a byproduct, the output of the Gibbs sampler could be used to estimate the log marginal likelihood, however we employ the approximation based on the Bayesian information criterion (BIC) which in the synthetic examples considered correctly located the number of common eigenvectors. We then succesfully applied the sampler to the source separation problem as well as the common principal component analysis and the common spatial pattern analysis problems. △ Less

Submitted 18 June, 2012; originally announced June 2012.

Comments: ICML2012

Showing 1–18 of 18 results for author: Zhong, M