Seeded Poisson Factorization: Leveraging domain knowledge to fit topic models
Authors:
Bernd Prostmaier,
Jan Vávra,
Bettina Grün,
Paul Hofmarcher
Abstract:
Topic models are widely used for discovering latent thematic structures in large text corpora, yet traditional unsupervised methods often struggle to align with predefined conceptual domains. This paper introduces Seeded Poisson Factorization (SPF), a novel approach that extends the Poisson Factorization framework by incorporating domain knowledge through seed words. SPF enables a more interpretab…
▽ More
Topic models are widely used for discovering latent thematic structures in large text corpora, yet traditional unsupervised methods often struggle to align with predefined conceptual domains. This paper introduces Seeded Poisson Factorization (SPF), a novel approach that extends the Poisson Factorization framework by incorporating domain knowledge through seed words. SPF enables a more interpretable and structured topic discovery by modifying the prior distribution of topic-specific term intensities, assigning higher initial rates to predefined seed words. The model is estimated using variational inference with stochastic gradient optimization, ensuring scalability to large datasets.
We apply SPF to an Amazon customer feedback dataset, leveraging predefined product categories as guiding structures. Our evaluation demonstrates that SPF achieves superior classification performance compared to alternative guided topic models, particularly in terms of computational efficiency and predictive performance. Furthermore, robustness checks highlight SPF's ability to adaptively balance domain knowledge and data-driven topic discovery, even in cases of imperfect seed word selection. These results establish SPF as a powerful and scalable alternative for integrating expert knowledge into topic modeling, enhancing both interpretability and efficiency in real-world applications.
△ Less
Submitted 4 March, 2025;
originally announced March 2025.
Revisiting Group Differences in High-Dimensional Choices: Method and Application to Congressional Speech
Authors:
Paul Hofmarcher,
Jan Vávra,
Sourav Adhikari,
Bettina Grün
Abstract:
Gentzkow, Shapiro and Taddy, Econometrica Vol 87, No 4, 2019 (henceforth GST) use a supervised text-based regression model to assess changes in partisanship in U.S. congressional speech over time. Their estimates imply that partisanship is far greater in recent years than in the past, and that it increased sharply in the early 1990s. The paper at hand provides a replication in the wide sense of GS…
▽ More
Gentzkow, Shapiro and Taddy, Econometrica Vol 87, No 4, 2019 (henceforth GST) use a supervised text-based regression model to assess changes in partisanship in U.S. congressional speech over time. Their estimates imply that partisanship is far greater in recent years than in the past, and that it increased sharply in the early 1990s. The paper at hand provides a replication in the wide sense of GST by complementing their analysis in three ways. First, we propose an alternative unsupervised language model, which combines ideas of topic models and ideal point models, to analyze the change in partisanship over time. We apply this model to the Senate speech data used in GST ranging from 1981-2017. Using our model we replicate their results on the specific evolution of partisanship. Second, our model provides additional insights such as the data-driven estimation of evolvement of topical contents over time. Third, we identify key phrases of partisanship on topic level.
△ Less
Submitted 11 January, 2025; v1 submitted 22 June, 2022;
originally announced June 2022.