-
Generalized Naive Bayes
Authors:
Edith Alice Kovács,
Anna Ország,
Dániel Pfeifer,
András Benczúr
Abstract:
In this paper we introduce the so-called Generalized Naive Bayes structure as an extension of the Naive Bayes structure. We give a new greedy algorithm that finds a good fitting Generalized Naive Bayes (GNB) probability distribution. We prove that this fits the data at least as well as the probability distribution determined by the classical Naive Bayes (NB). Then, under a not very restrictive con…
▽ More
In this paper we introduce the so-called Generalized Naive Bayes structure as an extension of the Naive Bayes structure. We give a new greedy algorithm that finds a good fitting Generalized Naive Bayes (GNB) probability distribution. We prove that this fits the data at least as well as the probability distribution determined by the classical Naive Bayes (NB). Then, under a not very restrictive condition, we give a second algorithm for which we can prove that it finds the optimal GNB probability distribution, i.e. best fitting structure in the sense of KL divergence. Both algorithms are constructed to maximize the information content and aim to minimize redundancy. Based on these algorithms, new methods for feature selection are introduced. We discuss the similarities and differences to other related algorithms in terms of structure, methodology, and complexity. Experimental results show, that the algorithms introduced outperform the related algorithms in many cases.
△ Less
Submitted 28 August, 2024;
originally announced August 2024.
-
Matrix and graph representations of vine copula structures
Authors:
Dániel Pfeifer,
Edith Alice Kovács
Abstract:
Vine copulas can efficiently model multivariate probability distributions. This paper focuses on a more thorough understanding of their structures, since in the literature, vine copula representations are often ambiguous. The graph representations include the original, cherry and chordal graph sequence structures, which we show equivalence between. Importantly we also show a new result, namely tha…
▽ More
Vine copulas can efficiently model multivariate probability distributions. This paper focuses on a more thorough understanding of their structures, since in the literature, vine copula representations are often ambiguous. The graph representations include the original, cherry and chordal graph sequence structures, which we show equivalence between. Importantly we also show a new result, namely that when a perfect elimination ordering of a vine structure is given, then it can always be uniquely represented with a matrix. O. M. Nápoles has shown a way to represent vines in a matrix, and we algorithmify this previous approach, while also showing a new method for constructing such a matrix, through cherry tree sequences. We also calculate the runtime of these algorithms. Lastly, we prove that these two matrix-building algorithms are equivalent if the same perfect elimination ordering is being used.
△ Less
Submitted 10 March, 2023; v1 submitted 10 May, 2022;
originally announced May 2022.
-
Topic Grouper: An Agglomerative Clustering Approach to Topic Modeling
Authors:
Daniel Pfeifer,
Jochen L. Leidner
Abstract:
We introduce Topic Grouper as a complementary approach in the field of probabilistic topic modeling. Topic Grouper creates a disjunctive partitioning of the training vocabulary in a stepwise manner such that resulting partitions represent topics. It is governed by a simple generative model, where the likelihood to generate the training documents via topics is optimized. The algorithm starts with o…
▽ More
We introduce Topic Grouper as a complementary approach in the field of probabilistic topic modeling. Topic Grouper creates a disjunctive partitioning of the training vocabulary in a stepwise manner such that resulting partitions represent topics. It is governed by a simple generative model, where the likelihood to generate the training documents via topics is optimized. The algorithm starts with one-word topics and joins two topics at every step. It therefore generates a solution for every desired number of topics ranging between the size of the training vocabulary and one. The process represents an agglomerative clustering that corresponds to a binary tree of topics. A resulting tree may act as a containment hierarchy, typically with more general topics towards the root of tree and more specific topics towards the leaves. Topic Grouper is not governed by a background distribution such as the Dirichlet and avoids hyper parameter optimizations.
We show that Topic Grouper has reasonable predictive power and also a reasonable theoretical and practical complexity. Topic Grouper can deal well with stop words and function words and tends to push them into their own topics. Also, it can handle topic distributions, where some topics are more frequent than others. We present typical examples of computed topics from evaluation datasets, where topics appear conclusive and coherent. In this context, the fact that each word belongs to exactly one topic is not a major limitation; in some scenarios this can even be a genuine advantage, e.g.~a related shopping basket analysis may aid in optimizing groupings of articles in sales catalogs.
△ Less
Submitted 13 April, 2019;
originally announced April 2019.
-
Theory and Practice of Transactional Method Caching
Authors:
Daniel Pfeifer,
Peter C. Lockemann
Abstract:
Nowadays, tiered architectures are widely accepted for constructing large scale information systems. In this context application servers often form the bottleneck for a system's efficiency. An application server exposes an object oriented interface consisting of set of methods which are accessed by potentially remote clients. The idea of method caching is to store results of read-only method inv…
▽ More
Nowadays, tiered architectures are widely accepted for constructing large scale information systems. In this context application servers often form the bottleneck for a system's efficiency. An application server exposes an object oriented interface consisting of set of methods which are accessed by potentially remote clients. The idea of method caching is to store results of read-only method invocations with respect to the application server's interface on the client side. If the client invokes the same method with the same arguments again, the corresponding result can be taken from the cache without contacting the server. It has been shown that this approach can considerably improve a real world system's efficiency.
This paper extends the concept of method caching by addressing the case where clients wrap related method invocations in ACID transactions. Demarcating sequences of method calls in this way is supported by many important application server standards. In this context the paper presents an architecture, a theory and an efficient protocol for maintaining full transactional consistency and in particular serializability when using a method cache on the client side. In order to create a protocol for scheduling cached method results, the paper extends a classical transaction formalism. Based on this extension, a recovery protocol and an optimistic serializability protocol are derived. The latter one differs from traditional transactional cache protocols in many essential ways. An efficiency experiment validates the approach: Using the cache a system's performance and scalability are considerably improved.
△ Less
Submitted 11 March, 2005; v1 submitted 9 March, 2005;
originally announced March 2005.