-
Throughput-Optimal Scheduling Algorithms for LLM Inference and AI Agents
Authors:
Yueying Li,
Jim Dai,
Tianyi Peng
Abstract:
As demand for Large Language Models (LLMs) and AI agents rapidly grows, optimizing systems for efficient LLM inference becomes critical. While significant efforts have focused on system-level engineering, little is explored from a mathematical modeling and queuing perspective.
In this paper, we aim to develop the queuing fundamentals for large language model (LLM) inference, bridging the gap bet…
▽ More
As demand for Large Language Models (LLMs) and AI agents rapidly grows, optimizing systems for efficient LLM inference becomes critical. While significant efforts have focused on system-level engineering, little is explored from a mathematical modeling and queuing perspective.
In this paper, we aim to develop the queuing fundamentals for large language model (LLM) inference, bridging the gap between the queueing theory and LLM system communities. In particular, we study the throughput aspect in LLM inference systems. We prove that a large class of 'work-conserving' scheduling algorithms can achieve maximum throughput for individual inference LLM engine, highlighting 'work-conserving' as a key design principle in practice. In a network of LLM agents, work-conserving scheduling alone is insufficient, particularly when facing specific workload structures and multi-class workflows that require more sophisticated scheduling strategies. Evaluations of real-world systems show that Orca and Sarathi-serve are throughput-optimal, reassuring practitioners, while FasterTransformer and vanilla vLLM are not maximally stable and should be used with caution. Our results highlight the substantial benefits that the queueing community can offer in improving LLM inference systems and call for more interdisciplinary development.
△ Less
Submitted 24 April, 2025; v1 submitted 9 April, 2025;
originally announced April 2025.
-
A Systematic Review of Machine Learning Approaches for Detecting Deceptive Activities on Social Media: Methods, Challenges, and Biases
Authors:
Yunchong Liu,
Xiaorui Shen,
Yeyubei Zhang,
Zhongyan Wang,
Yexin Tian,
Jianglai Dai,
Yuchen Cao
Abstract:
Social media platforms like Twitter, Facebook, and Instagram have facilitated the spread of misinformation, necessitating automated detection systems. This systematic review evaluates 36 studies that apply machine learning (ML) and deep learning (DL) models to detect fake news, spam, and fake accounts on social media. Using the Prediction model Risk Of Bias ASsessment Tool (PROBAST), the review id…
▽ More
Social media platforms like Twitter, Facebook, and Instagram have facilitated the spread of misinformation, necessitating automated detection systems. This systematic review evaluates 36 studies that apply machine learning (ML) and deep learning (DL) models to detect fake news, spam, and fake accounts on social media. Using the Prediction model Risk Of Bias ASsessment Tool (PROBAST), the review identified key biases across the ML lifecycle: selection bias due to non-representative sampling, inadequate handling of class imbalance, insufficient linguistic preprocessing (e.g., negations), and inconsistent hyperparameter tuning. Although models such as Support Vector Machines (SVM), Random Forests, and Long Short-Term Memory (LSTM) networks showed strong potential, over-reliance on accuracy as an evaluation metric in imbalanced data settings was a common flaw. The review highlights the need for improved data preprocessing (e.g., resampling techniques), consistent hyperparameter tuning, and the use of appropriate metrics like precision, recall, F1 score, and AUROC. Addressing these limitations can lead to more reliable and generalizable ML/DL models for detecting deceptive content, ultimately contributing to the reduction of misinformation on social media.
△ Less
Submitted 9 March, 2025; v1 submitted 26 October, 2024;
originally announced October 2024.
-
DMseg: a Python algorithm for de novo detection of differentially or variably methylated regions
Authors:
Xiaoyu Wang,
Ming Yu,
William Grady,
Ziding Feng,
Wei Sun,
James Y Dai
Abstract:
Detecting and assessing statistical significance of differentially methylated regions (DMRs) is a fundamental task in methylome association studies. While the average differential methylation in different phenotype groups has been the inferential focus, methylation changes in chromosomal regions may also present as differential variability, i.e., variably methylated regions (VMRs). Testing statist…
▽ More
Detecting and assessing statistical significance of differentially methylated regions (DMRs) is a fundamental task in methylome association studies. While the average differential methylation in different phenotype groups has been the inferential focus, methylation changes in chromosomal regions may also present as differential variability, i.e., variably methylated regions (VMRs). Testing statistical significance of regional differential methylation is a challenging problem, and existing algorithms do not provide accurate type I error control for genome-wide DMR or VMR analysis. No algorithm has been publicly available for detecting VMRs. We propose DMseg, a Python algorithm with efficient DMR/VMR detection and significance assessment for array-based methylome data, and compare its performance to Bumphunter, a popular existing algorithm. Operationally, DMseg searches for DMRs or VMRs within CpG clusters that are adaptively determined by both gap distance and correlation between contiguous CpG sites in a microarray. Levene test was implemented for assessing differential variability of individual CpGs. A likelihood ratio statistic is proposed to test for a constant difference within CpGs in a DMR or VMR to summarize the evidence of regional difference. Using a stratified permutation scheme and pooling null distributions of LRTs from clusters with similar numbers of CpGs, DMseg provides accurate control of the type I error rate. In simulation experiments, DMseg shows superior power than Bumphunter to detect DMRs. Application to methylome data of Barrett's esophagus and esophageal adenocarcinoma reveals a number of DMRs and VMRs of biological interest.
△ Less
Submitted 26 June, 2023;
originally announced June 2023.
-
Incorporating increased variability in testing for cancer DNA methylation
Authors:
James Y. Dai,
Heng Chen,
Xiaoyu Wang,
Wei Sun,
Ying Huang,
William M. Grady,
Ziding Feng
Abstract:
Cancer development is associated with aberrant DNA methylation, including increased stochastic variability. Statistical tests for discovering cancer methylation biomarkers have focused on changes in mean methylation. To improve the power of detection, we propose to incorporate increased variability in testing for cancer differential methylation by two joint constrained tests: one for differential…
▽ More
Cancer development is associated with aberrant DNA methylation, including increased stochastic variability. Statistical tests for discovering cancer methylation biomarkers have focused on changes in mean methylation. To improve the power of detection, we propose to incorporate increased variability in testing for cancer differential methylation by two joint constrained tests: one for differential mean and increased variance, the other for increased mean and increased variance. To improve small sample properties, likelihood ratio statistics are developed, accounting for the variability in estimating the sample medians in the Levene test. Efficient algorithms were developed and implemented in DMVC function of R package DMtest. The proposed joint constrained tests were compared to standard tests and partial area under the curve (pAUC) for the receiver operating characteristic curve (ROC) in simulated datasets under diverse models. Application to the high-throughput methylome data in The Cancer Genome Atlas (TCGA) shows substantially increased yield of candidate CpG markers.
△ Less
Submitted 26 June, 2023;
originally announced June 2023.
-
Clip-OGD: An Experimental Design for Adaptive Neyman Allocation in Sequential Experiments
Authors:
Jessica Dai,
Paula Gradu,
Christopher Harshaw
Abstract:
From clinical development of cancer therapies to investigations into partisan bias, adaptive sequential designs have become increasingly popular method for causal inference, as they offer the possibility of improved precision over their non-adaptive counterparts. However, even in simple settings (e.g. two treatments) the extent to which adaptive designs can improve precision is not sufficiently we…
▽ More
From clinical development of cancer therapies to investigations into partisan bias, adaptive sequential designs have become increasingly popular method for causal inference, as they offer the possibility of improved precision over their non-adaptive counterparts. However, even in simple settings (e.g. two treatments) the extent to which adaptive designs can improve precision is not sufficiently well understood. In this work, we study the problem of Adaptive Neyman Allocation in a design-based potential outcomes framework, where the experimenter seeks to construct an adaptive design which is nearly as efficient as the optimal (but infeasible) non-adaptive Neyman design, which has access to all potential outcomes. Motivated by connections to online optimization, we propose Neyman Ratio and Neyman Regret as two (equivalent) performance measures of adaptive designs for this problem. We present Clip-OGD, an adaptive design which achieves $\widetilde{O}(\sqrt{T})$ expected Neyman regret and thereby recovers the optimal Neyman variance in large samples. Finally, we construct a conservative variance estimator which facilitates the development of asymptotically valid confidence intervals. To complement our theoretical results, we conduct simulations using data from a microeconomic experiment.
△ Less
Submitted 13 October, 2023; v1 submitted 26 May, 2023;
originally announced May 2023.
-
Augmented Learning of Heterogeneous Treatment Effects via Gradient Boosting Trees
Authors:
Heng Chen,
Michael L. LeBlanc,
James Y. Dai
Abstract:
Heterogeneous treatment effects (HTE) based on patients' genetic or clinical factors are of significant interest to precision medicine. Simultaneously modeling HTE and corresponding main effects for randomized clinical trials with high-dimensional predictive markers is challenging. Motivated by the modified covariates approach, we propose a two-stage statistical learning procedure for estimating H…
▽ More
Heterogeneous treatment effects (HTE) based on patients' genetic or clinical factors are of significant interest to precision medicine. Simultaneously modeling HTE and corresponding main effects for randomized clinical trials with high-dimensional predictive markers is challenging. Motivated by the modified covariates approach, we propose a two-stage statistical learning procedure for estimating HTE with optimal efficiency augmentation, generalizing to arbitrary interaction model and exploiting powerful extreme gradient boosting trees (XGBoost). Target estimands for HTE are defined in the scale of mean difference for quantitative outcomes, or risk ratio for binary outcomes, which are the minimizers of specialized loss functions. The first stage is to estimate the main-effect equivalency of the baseline markers on the outcome, which is then used as an augmentation term in the second stage estimation for HTE. The proposed two-stage procedure is robust to model mis-specification of main effects and improves efficiency for estimating HTE through nonparametric function estimation, e.g., XGBoost. A permutation test is proposed for global assessment of evidence for HTE. An analysis of a genetic study in Prostate Cancer Prevention Trial led by the SWOG Cancer Research Network, is conducted to showcase the properties and the utilities of the two-stage method.
△ Less
Submitted 2 February, 2023;
originally announced February 2023.
-
Heavy-Tailed Loss Frequencies from Mixtures of Negative Binomial and Poisson Counts
Authors:
Jiansheng Dai,
Ziheng Huang,
Michael R. Powers,
Jiaxin Xu
Abstract:
Heavy-tailed random variables have been used in insurance research to model both loss frequencies and loss severities, with substantially more emphasis on the latter. In the present work, we take a step toward addressing this imbalance by exploring the class of heavy-tailed frequency models formed by continuous mixtures of Negative Binomial and Poisson random variables. We begin by defining the co…
▽ More
Heavy-tailed random variables have been used in insurance research to model both loss frequencies and loss severities, with substantially more emphasis on the latter. In the present work, we take a step toward addressing this imbalance by exploring the class of heavy-tailed frequency models formed by continuous mixtures of Negative Binomial and Poisson random variables. We begin by defining the concept of a calibrative family of mixing distributions (each member of which is identifiable from its associated Negative Binomial mixture), and show how to construct such families from only a single member. We then introduce a new heavy-tailed frequency model -- the two-parameter ZY distribution -- as a generalization of both the one-parameter Zeta and Yule distributions, and construct calibrative families for both the new distribution and the heavy-tailed two-parameter Waring distribution. Finally, we pursue natural extensions of both the ZY and Waring families to a unifying, four-parameter heavy-tailed model, providing the foundation for a novel loss-frequency modeling approach to complement conventional GLM analyses. This approach is illustrated by application to a classic set of Swedish commercial motor-vehicle insurance loss data.
△ Less
Submitted 10 November, 2022; v1 submitted 7 November, 2022;
originally announced November 2022.
-
Fair Machine Learning Under Partial Compliance
Authors:
Jessica Dai,
Sina Fazelpour,
Zachary C. Lipton
Abstract:
Typically, fair machine learning research focuses on a single decisionmaker and assumes that the underlying population is stationary. However, many of the critical domains motivating this work are characterized by competitive marketplaces with many decisionmakers. Realistically, we might expect only a subset of them to adopt any non-compulsory fairness-conscious policy, a situation that political…
▽ More
Typically, fair machine learning research focuses on a single decisionmaker and assumes that the underlying population is stationary. However, many of the critical domains motivating this work are characterized by competitive marketplaces with many decisionmakers. Realistically, we might expect only a subset of them to adopt any non-compulsory fairness-conscious policy, a situation that political philosophers call partial compliance. This possibility raises important questions: how does the strategic behavior of decision subjects in partial compliance settings affect the allocation outcomes? If k% of employers were to voluntarily adopt a fairness-promoting intervention, should we expect k% progress (in aggregate) towards the benefits of universal adoption, or will the dynamics of partial compliance wash out the hoped-for benefits? How might adopting a global (versus local) perspective impact the conclusions of an auditor? In this paper, we propose a simple model of an employment market, leveraging simulation as a tool to explore the impact of both interaction effects and incentive effects on outcomes and auditing metrics. Our key findings are that at equilibrium: (1) partial compliance (k% of employers) can result in far less than proportional (k%) progress towards the full compliance outcomes; (2) the gap is more severe when fair employers match global (vs local) statistics; (3) choices of local vs global statistics can paint dramatically different pictures of the performance vis-a-vis fairness desiderata of compliant versus non-compliant employers; and (4) partial compliance to local parity measures can induce extreme segregation.
△ Less
Submitted 26 September, 2022; v1 submitted 6 November, 2020;
originally announced November 2020.
-
Characterizing the Zeta Distribution via Continuous Mixtures
Authors:
Jiansheng Dai,
Ziheng Huang,
Michael R. Powers,
Jiaxin Xu
Abstract:
We offer two novel characterizations of the Zeta distribution: first, as tractable continuous mixtures of Negative Binomial distributions (with fixed shape parameter, r > 0), and second, as a tractable continuous mixture of Poisson distributions. In both the Negative Binomial case for r >= 1 and the Poisson case, the resulting Zeta distributions are identifiable because each mixture can be associa…
▽ More
We offer two novel characterizations of the Zeta distribution: first, as tractable continuous mixtures of Negative Binomial distributions (with fixed shape parameter, r > 0), and second, as a tractable continuous mixture of Poisson distributions. In both the Negative Binomial case for r >= 1 and the Poisson case, the resulting Zeta distributions are identifiable because each mixture can be associated with a unique mixing distribution. In the Negative Binomial case for 0 < r < 1, the mixing distributions are quasi-distributions (for which the quasi-probability density function assumes some negative values).
△ Less
Submitted 4 June, 2021; v1 submitted 14 August, 2020;
originally announced August 2020.
-
Mitigating backdoor attacks in LSTM-based Text Classification Systems by Backdoor Keyword Identification
Authors:
Chuanshuai Chen,
Jiazhu Dai
Abstract:
It has been proved that deep neural networks are facing a new threat called backdoor attacks, where the adversary can inject backdoors into the neural network model through poisoning the training dataset. When the input containing some special pattern called the backdoor trigger, the model with backdoor will carry out malicious task such as misclassification specified by adversaries. In text class…
▽ More
It has been proved that deep neural networks are facing a new threat called backdoor attacks, where the adversary can inject backdoors into the neural network model through poisoning the training dataset. When the input containing some special pattern called the backdoor trigger, the model with backdoor will carry out malicious task such as misclassification specified by adversaries. In text classification systems, backdoors inserted in the models can cause spam or malicious speech to escape detection. Previous work mainly focused on the defense of backdoor attacks in computer vision, little attention has been paid to defense method for RNN backdoor attacks regarding text classification. In this paper, through analyzing the changes in inner LSTM neurons, we proposed a defense method called Backdoor Keyword Identification (BKI) to mitigate backdoor attacks which the adversary performs against LSTM-based text classification by data poisoning. This method can identify and exclude poisoning samples crafted to insert backdoor into the model from training data without a verified and trusted dataset. We evaluate our method on four different text classification datset: IMDB, DBpedia ontology, 20 newsgroups and Reuters-21578 dataset. It all achieves good performance regardless of the trigger sentences.
△ Less
Submitted 14 March, 2021; v1 submitted 11 July, 2020;
originally announced July 2020.
-
Fast-UAP: An Algorithm for Speeding up Universal Adversarial Perturbation Generation with Orientation of Perturbation Vectors
Authors:
Jiazhu Dai,
Le Shu
Abstract:
Convolutional neural networks (CNN) have become one of the most popular machine learning tools and are being applied in various tasks, however, CNN models are vulnerable to universal perturbations, which are usually human-imperceptible but can cause natural images to be misclassified with high probability. One of the state-of-the-art algorithms to generate universal perturbations is known as UAP.…
▽ More
Convolutional neural networks (CNN) have become one of the most popular machine learning tools and are being applied in various tasks, however, CNN models are vulnerable to universal perturbations, which are usually human-imperceptible but can cause natural images to be misclassified with high probability. One of the state-of-the-art algorithms to generate universal perturbations is known as UAP. UAP only aggregates the minimal perturbations in every iteration, which will lead to generated universal perturbation whose magnitude cannot rise up efficiently and cause a slow generation. In this paper, we proposed an optimized algorithm to improve the performance of crafting universal perturbations based on orientation of perturbation vectors. At each iteration, instead of choosing minimal perturbation vector with respect to each image, we aggregate the current instance of universal perturbation with the perturbation which has similar orientation to the former so that the magnitude of the aggregation will rise up as large as possible at every iteration. The experiment results show that we get universal perturbations in a shorter time and with a smaller number of training images. Furthermore, we observe in experiments that universal perturbations generated by our proposed algorithm have an average increment of fooling rate by 9% in white-box attacks and black-box attacks comparing with universal perturbations generated by UAP.
△ Less
Submitted 6 January, 2020; v1 submitted 4 November, 2019;
originally announced November 2019.
-
Signal Demodulation with Machine Learning Methods for Physical Layer Visible Light Communications: Prototype Platform, Open Dataset and Algorithms
Authors:
Shuai Ma,
Jiahui Dai,
Songtao Lu,
Hang Li,
Han Zhang,
Chun Du,
Shiyin Li
Abstract:
In this paper, we investigate the design and implementation of machine learning (ML) based demodulation methods in the physical layer of visible light communication (VLC) systems. We build a flexible hardware prototype of an end-to-end VLC system, from which the received signals are collected as the real data. The dataset is available online, which contains eight types of modulated signals. Then,…
▽ More
In this paper, we investigate the design and implementation of machine learning (ML) based demodulation methods in the physical layer of visible light communication (VLC) systems. We build a flexible hardware prototype of an end-to-end VLC system, from which the received signals are collected as the real data. The dataset is available online, which contains eight types of modulated signals. Then, we propose three ML demodulators based on convolutional neural network (CNN), deep belief network (DBN), and adaptive boosting (AdaBoost), respectively. Specifically, the CNN based demodulator converts the modulated signals to images and recognizes the signals by the image classification. The proposed DBN based demodulator contains three restricted Boltzmann machines (RBMs) to extract the modulation features. The AdaBoost method includes a strong classifier that is constructed by the weak classifiers with the k-nearest neighbor (KNN) algorithm. These three demodulators are trained and tested by our online open dataset. Experimental results show that the demodulation accuracy of the three data-driven demodulators drops as the transmission distance increases. A higher modulation order negatively influences the accuracy for a given transmission distance. Among the three ML methods, the AdaBoost modulator achieves the best performance.
△ Less
Submitted 13 March, 2019;
originally announced March 2019.
-
Structures and Assumptions: Strategies to Harness Gene $\times$ Gene and Gene $\times$ Environment Interactions in GWAS
Authors:
Charles Kooperberg,
Michael LeBlanc,
James Y. Dai,
Indika Rajapakse
Abstract:
Genome-wide association studies, in which as many as a million single nucleotide polymorphisms (SNP) are measured on several thousand samples, are quickly becoming a common type of study for identifying genetic factors associated with many phenotypes. There is a strong assumption that interactions between SNPs or genes and interactions between genes and environmental factors substantially contribu…
▽ More
Genome-wide association studies, in which as many as a million single nucleotide polymorphisms (SNP) are measured on several thousand samples, are quickly becoming a common type of study for identifying genetic factors associated with many phenotypes. There is a strong assumption that interactions between SNPs or genes and interactions between genes and environmental factors substantially contribute to the genetic risk of a disease. Identification of such interactions could potentially lead to increased understanding about disease mechanisms; drug $\times$ gene interactions could have profound applications for personalized medicine; strong interaction effects could be beneficial for risk prediction models. In this paper we provide an overview of different approaches to model interactions, emphasizing approaches that make specific use of the structure of genetic data, and those that make specific modeling assumptions that may (or may not) be reasonable to make. We conclude that to identify interactions it is often necessary to do some selection of SNPs, for example, based on prior hypothesis or marginal significance, but that to identify SNPs that are marginally associated with a disease it may also be useful to consider larger numbers of interactions.
△ Less
Submitted 22 October, 2010;
originally announced October 2010.