-
Scaling Probabilistic Circuits via Monarch Matrices
Authors:
Honghua Zhang,
Meihua Dang,
Benjie Wang,
Stefano Ermon,
Nanyun Peng,
Guy Van den Broeck
Abstract:
Probabilistic Circuits (PCs) are tractable representations of probability distributions allowing for exact and efficient computation of likelihoods and marginals. Recent advancements have improved the scalability of PCs either by leveraging their sparse properties or through the use of tensorized operations for better hardware utilization. However, no existing method fully exploits both aspects si…
▽ More
Probabilistic Circuits (PCs) are tractable representations of probability distributions allowing for exact and efficient computation of likelihoods and marginals. Recent advancements have improved the scalability of PCs either by leveraging their sparse properties or through the use of tensorized operations for better hardware utilization. However, no existing method fully exploits both aspects simultaneously. In this paper, we propose a novel sparse and structured parameterization for the sum blocks in PCs. By replacing dense matrices with sparse Monarch matrices, we significantly reduce the memory and computation costs, enabling unprecedented scaling of PCs. From a theory perspective, our construction arises naturally from circuit multiplication; from a practical perspective, compared to previous efforts on scaling up tractable probabilistic models, our approach not only achieves state-of-the-art generative modeling performance on challenging benchmarks like Text8, LM1B and ImageNet, but also demonstrates superior scaling behavior, achieving the same performance with substantially less compute as measured by the number of floating-point operations (FLOPs) during training.
△ Less
Submitted 14 June, 2025;
originally announced June 2025.
-
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Authors:
Aarohi Srivastava,
Abhinav Rastogi,
Abhishek Rao,
Abu Awal Md Shoeb,
Abubakar Abid,
Adam Fisch,
Adam R. Brown,
Adam Santoro,
Aditya Gupta,
Adrià Garriga-Alonso,
Agnieszka Kluska,
Aitor Lewkowycz,
Akshat Agarwal,
Alethea Power,
Alex Ray,
Alex Warstadt,
Alexander W. Kocurek,
Ali Safaya,
Ali Tazarv,
Alice Xiang,
Alicia Parrish,
Allen Nie,
Aman Hussain,
Amanda Askell,
Amanda Dsouza
, et al. (426 additional authors not shown)
Abstract:
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur…
▽ More
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.
△ Less
Submitted 12 June, 2023; v1 submitted 9 June, 2022;
originally announced June 2022.
-
The Causal Learning of Retail Delinquency
Authors:
Yiyan Huang,
Cheuk Hang Leung,
Xing Yan,
Qi Wu,
Nanbo Peng,
Dongdong Wang,
Zhixiang Huang
Abstract:
This paper focuses on the expected difference in borrower's repayment when there is a change in the lender's credit decisions. Classical estimators overlook the confounding effects and hence the estimation error can be magnificent. As such, we propose another approach to construct the estimators such that the error can be greatly reduced. The proposed estimators are shown to be unbiased, consisten…
▽ More
This paper focuses on the expected difference in borrower's repayment when there is a change in the lender's credit decisions. Classical estimators overlook the confounding effects and hence the estimation error can be magnificent. As such, we propose another approach to construct the estimators such that the error can be greatly reduced. The proposed estimators are shown to be unbiased, consistent, and robust through a combination of theoretical analysis and numerical testing. Moreover, we compare the power of estimating the causal quantities between the classical estimators and the proposed estimators. The comparison is tested across a wide range of models, including linear regression models, tree-based models, and neural network-based models, under different simulated datasets that exhibit different levels of causality, different degrees of nonlinearity, and different distributional properties. Most importantly, we apply our approaches to a large observational dataset provided by a global technology firm that operates in both the e-commerce and the lending business. We find that the relative reduction of estimation error is strikingly substantial if the causal effects are accounted for correctly.
△ Less
Submitted 17 December, 2020;
originally announced December 2020.
-
Understanding Distributional Ambiguity via Non-robust Chance Constraint
Authors:
Qi Wu,
Shumin Ma,
Cheuk Hang Leung,
Wei Liu,
Nanbo Peng
Abstract:
This paper provides a non-robust interpretation of the distributionally robust optimization (DRO) problem by relating the distributional uncertainties to the chance probabilities. Our analysis allows a decision-maker to interpret the size of the ambiguity set, which is often lack of business meaning, through the chance parameters constraining the objective function. We first show that, for general…
▽ More
This paper provides a non-robust interpretation of the distributionally robust optimization (DRO) problem by relating the distributional uncertainties to the chance probabilities. Our analysis allows a decision-maker to interpret the size of the ambiguity set, which is often lack of business meaning, through the chance parameters constraining the objective function. We first show that, for general $φ$-divergences, a DRO problem is asymptotically equivalent to a class of mean-deviation problems. These mean-deviation problems are not subject to uncertain distributions, and the ambiguity radius in the original DRO problem now plays the role of controlling the risk preference of the decision-maker. We then demonstrate that a DRO problem can be cast as a chance-constrained optimization (CCO) problem when a boundedness constraint is added to the decision variables. Without the boundedness constraint, the CCO problem is shown to perform uniformly better than the DRO problem, irrespective of the radius of the ambiguity set, the choice of the divergence measure, or the tail heaviness of the center distribution. Thanks to our high-order expansion result, a notable feature of our analysis is that it applies to divergence measures that accommodate well heavy tail distributions such as the student $t$-distribution and the lognormal distribution, besides the widely-used Kullback-Leibler (KL) divergence, which requires the distribution of the objective function to be exponentially bounded. Using the portfolio selection problem as an example, our comprehensive testings on multivariate heavy-tail datasets, both synthetic and real-world, shows that this business-interpretation approach is indeed useful and insightful.
△ Less
Submitted 21 September, 2020; v1 submitted 3 June, 2019;
originally announced June 2019.