-
Goal-Oriented Bayesian Optimal Experimental Design for Nonlinear Models using Markov Chain Monte Carlo
Authors:
Shijie Zhong,
Wanggang Shen,
Tommie Catanach,
Xun Huan
Abstract:
Optimal experimental design (OED) provides a systematic approach to quantify and maximize the value of experimental data. Under a Bayesian approach, conventional OED maximizes the expected information gain (EIG) on model parameters. However, we are often interested in not the parameters themselves, but predictive quantities of interest (QoIs) that depend on the parameters in a nonlinear manner. We…
▽ More
Optimal experimental design (OED) provides a systematic approach to quantify and maximize the value of experimental data. Under a Bayesian approach, conventional OED maximizes the expected information gain (EIG) on model parameters. However, we are often interested in not the parameters themselves, but predictive quantities of interest (QoIs) that depend on the parameters in a nonlinear manner. We present a computational framework of predictive goal-oriented OED (GO-OED) suitable for nonlinear observation and prediction models, which seeks the experimental design providing the greatest EIG on the QoIs. In particular, we propose a nested Monte Carlo estimator for the QoI EIG, featuring Markov chain Monte Carlo for posterior sampling and kernel density estimation for evaluating the posterior-predictive density and its Kullback-Leibler divergence from the prior-predictive. The GO-OED design is then found by maximizing the EIG over the design space using Bayesian optimization. We demonstrate the effectiveness of the overall nonlinear GO-OED method, and illustrate its differences versus conventional non-GO-OED, through various test problems and an application of sensor placement for source inversion in a convection-diffusion field.
△ Less
Submitted 1 February, 2025; v1 submitted 26 March, 2024;
originally announced March 2024.
-
Bridging Domains with Approximately Shared Features
Authors:
Ziliang Samuel Zhong,
Xiang Pan,
Qi Lei
Abstract:
Multi-source domain adaptation aims to reduce performance degradation when applying machine learning models to unseen domains. A fundamental challenge is devising the optimal strategy for feature selection. Existing literature is somewhat paradoxical: some advocate for learning invariant features from source domains, while others favor more diverse features. To address the challenge, we propose a…
▽ More
Multi-source domain adaptation aims to reduce performance degradation when applying machine learning models to unseen domains. A fundamental challenge is devising the optimal strategy for feature selection. Existing literature is somewhat paradoxical: some advocate for learning invariant features from source domains, while others favor more diverse features. To address the challenge, we propose a statistical framework that distinguishes the utilities of features based on the variance of their correlation to label $y$ across domains. Under our framework, we design and analyze a learning procedure consisting of learning approximately shared feature representation from source tasks and fine-tuning it on the target task. Our theoretical analysis necessitates the importance of learning approximately shared features instead of only the strictly invariant features and yields an improved population risk compared to previous results on both source and target tasks, thus partly resolving the paradox mentioned above. Inspired by our theory, we proposed a more practical way to isolate the content (invariant+approximately shared) from environmental features and further consolidate our theoretical findings.
△ Less
Submitted 11 March, 2024;
originally announced March 2024.
-
Improved theoretical guarantee for rank aggregation via spectral method
Authors:
Ziliang Samuel Zhong,
Shuyang Ling
Abstract:
Given pairwise comparisons between multiple items, how to rank them so that the ranking matches the observations? This problem, known as rank aggregation, has found many applications in sports, recommendation systems, and other web applications. As it is generally NP-hard to find a global ranking that minimizes the mismatch (known as the Kemeny optimization), we focus on the Erdös-Rényi outliers (…
▽ More
Given pairwise comparisons between multiple items, how to rank them so that the ranking matches the observations? This problem, known as rank aggregation, has found many applications in sports, recommendation systems, and other web applications. As it is generally NP-hard to find a global ranking that minimizes the mismatch (known as the Kemeny optimization), we focus on the Erdös-Rényi outliers (ERO) model for this ranking problem. Here, each pairwise comparison is a corrupted copy of the true score difference. We investigate spectral ranking algorithms that are based on unnormalized and normalized data matrices. The key is to understand their performance in recovering the underlying scores of each item from the observed data. This reduces to deriving an entry-wise perturbation error bound between the top eigenvectors of the unnormalized/normalized data matrix and its population counterpart. By using the leave-one-out technique, we provide a sharper $\ell_{\infty}$-norm perturbation bound of the eigenvectors and also derive an error bound on the maximum displacement for each item, with only $Ω(n\log n)$ samples. Our theoretical analysis improves upon the state-of-the-art results in terms of sample complexity, and our numerical experiments confirm these theoretical findings.
△ Less
Submitted 10 September, 2023; v1 submitted 7 September, 2023;
originally announced September 2023.
-
Enrollment Forecast for Clinical Trials at the Planning Phase with Study-Level Historical Data
Authors:
Mengjia Yu,
Sheng Zhong,
Yunzhao Xing,
Li Wang
Abstract:
Given progressive developments and demands on clinical trials, accurate enrollment timeline forecasting is increasingly crucial for both strategic decision-making and trial execution excellence. Naive approach assumes flat rates on enrollment using average of historical data, while traditional statistical approach applies simple Poisson-Gamma model using timeinvariant rates for site activation and…
▽ More
Given progressive developments and demands on clinical trials, accurate enrollment timeline forecasting is increasingly crucial for both strategic decision-making and trial execution excellence. Naive approach assumes flat rates on enrollment using average of historical data, while traditional statistical approach applies simple Poisson-Gamma model using timeinvariant rates for site activation and subject recruitment. Both of them are lack of nontrivial factors such as time and location. We propose a novel two-segment statistical approach based on Quasi-Poisson regression for subject accrual rate and Poisson process for subject enrollment and site activation. The input study-level data is publicly accessible and it can be integrated with historical study data from user's organization to prospectively predict enrollment timeline. The new framework is neat and accurate compared to preceding works. We validate the performance of our proposed enrollment model and compare the results with other frameworks on 7 curated studies.
△ Less
Submitted 14 March, 2023;
originally announced March 2023.
-
Enrollment Forecast for Clinical Trials at the Portfolio Planning Phase Based on Site-Level Historical Data
Authors:
Sheng Zhong,
Yunzhao Xing,
Mengjia Yu,
Li Wang
Abstract:
Accurate forecast of a clinical trial enrollment timeline at the planning phase is of great importance to both corporate strategic planning and trial operational excellence. While predictions of key milestones such as last subject first dose date can inform strategic decision-making, detailed predictive insights (e.g., median number of enrolled subjects by month for a country) can facilitate the p…
▽ More
Accurate forecast of a clinical trial enrollment timeline at the planning phase is of great importance to both corporate strategic planning and trial operational excellence. While predictions of key milestones such as last subject first dose date can inform strategic decision-making, detailed predictive insights (e.g., median number of enrolled subjects by month for a country) can facilitate the planning of clinical trial operation activities and promote execution excellence. The naive approach often calculates an average enrollment rate from historical data and generates an inaccurate prediction based on a linear trend with the average rate. The traditional statistical approach utilizes the simple Poisson-Gamma model that assumes time-invariant site activation rates and it can fail to capture the underlying nonlinear patterns (e.g., up and down site activation pattern). We present a novel statistical approach based on generalized linear mixed-effects models and the use of non-homogeneous Poisson processes through Bayesian framework to model the country initiation, site activation and subject enrollment sequentially in a systematic fashion. We validate the performance of our proposed enrollment modeling framework based on a set of preselected 25 studies from four therapeutic areas. Our modeling framework shows a substantial improvement in prediction accuracy in comparison to the traditional statistical approach. Furthermore, we show that our modeling and simulation approach calibrates the data variability appropriately and gives correct coverage rates for prediction intervals of various nominal levels. Finally, we demonstrate the use of our approach to generate the predicted enrollment curves through time with confidence bands overlaid.
△ Less
Submitted 3 January, 2023;
originally announced January 2023.
-
S&P 500 Stock Price Prediction Using Technical, Fundamental and Text Data
Authors:
Shan Zhong,
David B. Hitchcock
Abstract:
We summarized both common and novel predictive models used for stock price prediction and combined them with technical indices, fundamental characteristics and text-based sentiment data to predict S&P stock prices. A 66.18% accuracy in S&P 500 index directional prediction and 62.09% accuracy in individual stock directional prediction was achieved by combining different machine learning models such…
▽ More
We summarized both common and novel predictive models used for stock price prediction and combined them with technical indices, fundamental characteristics and text-based sentiment data to predict S&P stock prices. A 66.18% accuracy in S&P 500 index directional prediction and 62.09% accuracy in individual stock directional prediction was achieved by combining different machine learning models such as Random Forest and LSTM together into state-of-the-art ensemble models. The data we use contains weekly historical prices, finance reports, and text information from news items associated with 518 different common stocks issued by current and former S&P 500 large-cap companies, from January 1, 2000 to December 31, 2019. Our study's innovation includes utilizing deep language models to categorize and infer financial news item sentiment; fusing different models containing different combinations of variables and stocks to jointly make predictions; and overcoming the insufficient data problem for machine learning models in time series by using data across different stocks.
△ Less
Submitted 22 September, 2021; v1 submitted 24 August, 2021;
originally announced August 2021.
-
Graph-to-Tree Neural Networks for Learning Structured Input-Output Translation with Applications to Semantic Parsing and Math Word Problem
Authors:
Shucheng Li,
Lingfei Wu,
Shiwei Feng,
Fangli Xu,
Fengyuan Xu,
Sheng Zhong
Abstract:
The celebrated Seq2Seq technique and its numerous variants achieve excellent performance on many tasks such as neural machine translation, semantic parsing, and math word problem solving. However, these models either only consider input objects as sequences while ignoring the important structural information for encoding, or they simply treat output objects as sequence outputs instead of structura…
▽ More
The celebrated Seq2Seq technique and its numerous variants achieve excellent performance on many tasks such as neural machine translation, semantic parsing, and math word problem solving. However, these models either only consider input objects as sequences while ignoring the important structural information for encoding, or they simply treat output objects as sequence outputs instead of structural objects for decoding. In this paper, we present a novel Graph-to-Tree Neural Networks, namely Graph2Tree consisting of a graph encoder and a hierarchical tree decoder, that encodes an augmented graph-structured input and decodes a tree-structured output. In particular, we investigated our model for solving two problems, neural semantic parsing and math word problem. Our extensive experiments demonstrate that our Graph2Tree model outperforms or matches the performance of other state-of-the-art models on these tasks.
△ Less
Submitted 6 October, 2020; v1 submitted 7 April, 2020;
originally announced April 2020.
-
Practical Deep Reinforcement Learning Approach for Stock Trading
Authors:
Xiao-Yang Liu,
Zhuoran Xiong,
Shan Zhong,
Hongyang Yang,
Anwar Walid
Abstract:
Stock trading strategy plays a crucial role in investment companies. However, it is challenging to obtain optimal strategy in the complex and dynamic stock market. We explore the potential of deep reinforcement learning to optimize stock trading strategy and thus maximize investment return. 30 stocks are selected as our trading stocks and their daily prices are used as the training and trading mar…
▽ More
Stock trading strategy plays a crucial role in investment companies. However, it is challenging to obtain optimal strategy in the complex and dynamic stock market. We explore the potential of deep reinforcement learning to optimize stock trading strategy and thus maximize investment return. 30 stocks are selected as our trading stocks and their daily prices are used as the training and trading market environment. We train a deep reinforcement learning agent and obtain an adaptive trading strategy. The agent's performance is evaluated and compared with Dow Jones Industrial Average and the traditional min-variance portfolio allocation strategy. The proposed deep reinforcement learning approach is shown to outperform the two baselines in terms of both the Sharpe ratio and cumulative returns.
△ Less
Submitted 30 July, 2022; v1 submitted 19 November, 2018;
originally announced November 2018.
-
Learning Embeddings of Directed Networks with Text-Associated Nodes---with Applications in Software Package Dependency Networks
Authors:
Kexuan Sun,
Shudan Zhong,
Hong Xu
Abstract:
A network embedding consists of a vector representation for each node in the network. Its usefulness has been shown in many real-world application domains, such as social networks and web networks. Directed networks with text associated with each node, such as software package dependency networks, are commonplace. However, to the best of our knowledge, their embeddings have hitherto not been speci…
▽ More
A network embedding consists of a vector representation for each node in the network. Its usefulness has been shown in many real-world application domains, such as social networks and web networks. Directed networks with text associated with each node, such as software package dependency networks, are commonplace. However, to the best of our knowledge, their embeddings have hitherto not been specifically studied. In this paper, we propose PCTADW-1 and PCTADW-2, two algorithms based on neural networks that learn embeddings of directed networks with text associated with each node. We create two new node-labeled such networks: The package dependency networks in two popular GNU/Linux distributions, Debian and Fedora. We experimentally demonstrate that the embeddings produced by our algorithms resulted in node classification with better quality than those of various baselines on these two networks. We observe that there exist systematic presence of analogies (similar to those in word embeddings) in the network embeddings of software package dependency networks. To the best of our knowledge, this is the first time that such systematic presence of analogies is observed in network and document embeddings. We further demonstrate that these network embeddings can be novelly used for better understanding software attributes, such as the development process and user interface of software, etc.
△ Less
Submitted 26 November, 2020; v1 submitted 6 September, 2018;
originally announced September 2018.
-
Beating the bookies with their own numbers - and how the online sports betting market is rigged
Authors:
Lisandro Kaunitz,
Shenjun Zhong,
Javier Kreiner
Abstract:
The online sports gambling industry employs teams of data analysts to build forecast models that turn the odds at sports games in their favour. While several betting strategies have been proposed to beat bookmakers, from expert prediction models and arbitrage strategies to odds bias exploitation, their returns have been inconsistent and it remains to be shown that a betting strategy can outperform…
▽ More
The online sports gambling industry employs teams of data analysts to build forecast models that turn the odds at sports games in their favour. While several betting strategies have been proposed to beat bookmakers, from expert prediction models and arbitrage strategies to odds bias exploitation, their returns have been inconsistent and it remains to be shown that a betting strategy can outperform the online sports betting market. We designed a strategy to beat football bookmakers with their own numbers. Instead of building a forecasting model to compete with bookmakers predictions, we exploited the probability information implicit in the odds publicly available in the marketplace to find bets with mispriced odds. Our strategy proved profitable in a 10-year historical simulation using closing odds, a 6-month historical simulation using minute to minute odds, and a 5-month period during which we staked real money with the bookmakers (we made code, data and models publicly available). Our results demonstrate that the football betting market is inefficient - bookmakers can be consistently beaten across thousands of games in both simulated environments and real-life betting. We provide a detailed description of our betting experience to illustrate how the sports gambling industry compensates these market inefficiencies with discriminatory practices against successful clients.
△ Less
Submitted 10 November, 2017; v1 submitted 8 October, 2017;
originally announced October 2017.