-
Using Zero-shot Prompting in the Automatic Creation and Expansion of Topic Taxonomies for Tagging Retail Banking Transactions
Authors:
Daniel de S. Moraes,
Pedro T. C. Santos,
Polyana B. da Costa,
Matheus A. S. Pinto,
Ivan de J. P. Pinto,
Álvaro M. G. da Veiga,
Sergio Colcher,
Antonio J. G. Busson,
Rafael H. Rocha,
Rennan Gaio,
Rafael Miceli,
Gabriela Tourinho,
Marcos Rabaioli,
Leandro Santos,
Fellipe Marques,
David Favaro
Abstract:
This work presents an unsupervised method for automatically constructing and expanding topic taxonomies using instruction-based fine-tuned LLMs (Large Language Models). We apply topic modeling and keyword extraction techniques to create initial topic taxonomies and LLMs to post-process the resulting terms and create a hierarchy. To expand an existing taxonomy with new terms, we use zero-shot promp…
▽ More
This work presents an unsupervised method for automatically constructing and expanding topic taxonomies using instruction-based fine-tuned LLMs (Large Language Models). We apply topic modeling and keyword extraction techniques to create initial topic taxonomies and LLMs to post-process the resulting terms and create a hierarchy. To expand an existing taxonomy with new terms, we use zero-shot prompting to find out where to add new nodes, which, to our knowledge, is the first work to present such an approach to taxonomy tasks. We use the resulting taxonomies to assign tags that characterize merchants from a retail bank dataset. To evaluate our work, we asked 12 volunteers to answer a two-part form in which we first assessed the quality of the taxonomies created and then the tags assigned to merchants based on that taxonomy. The evaluation revealed a coherence rate exceeding 90% for the chosen taxonomies. The taxonomies' expansion with LLMs also showed exciting results for parent node prediction, with an f1-score above 70% in our taxonomies.
△ Less
Submitted 11 February, 2024; v1 submitted 7 January, 2024;
originally announced January 2024.
-
Generalized Information Criteria for Structured Sparse Models
Authors:
Eduardo F. Mendes,
Gabriel J. P. Pinto
Abstract:
Regularized m-estimators are widely used due to their ability of recovering a low-dimensional model in high-dimensional scenarios. Some recent efforts on this subject focused on creating a unified framework for establishing oracle bounds, and deriving conditions for support recovery. Under this same framework, we propose a new Generalized Information Criteria (GIC) that takes into consideration th…
▽ More
Regularized m-estimators are widely used due to their ability of recovering a low-dimensional model in high-dimensional scenarios. Some recent efforts on this subject focused on creating a unified framework for establishing oracle bounds, and deriving conditions for support recovery. Under this same framework, we propose a new Generalized Information Criteria (GIC) that takes into consideration the sparsity pattern one wishes to recover. We obtain non-asymptotic model selection bounds and sufficient conditions for model selection consistency of the GIC. Furthermore, we show that the GIC can also be used for selecting the regularization parameter within a regularized $m$-estimation framework, which allows practical use of the GIC for model selection in high-dimensional scenarios. We provide examples of group LASSO in the context of generalized linear regression and low rank matrix regression.
△ Less
Submitted 4 September, 2023;
originally announced September 2023.
-
Parallel Clustering of Single Cell Transcriptomic Data with Split-Merge Sampling on Dirichlet Process Mixtures
Authors:
Tiehang Duan,
José P. Pinto,
Xiaohui Xie
Abstract:
Motivation: With the development of droplet based systems, massive single cell transcriptome data has become available, which enables analysis of cellular and molecular processes at single cell resolution and is instrumental to understanding many biological processes. While state-of-the-art clustering methods have been applied to the data, they face challenges in the following aspects: (1) the clu…
▽ More
Motivation: With the development of droplet based systems, massive single cell transcriptome data has become available, which enables analysis of cellular and molecular processes at single cell resolution and is instrumental to understanding many biological processes. While state-of-the-art clustering methods have been applied to the data, they face challenges in the following aspects: (1) the clustering quality still needs to be improved; (2) most models need prior knowledge on number of clusters, which is not always available; (3) there is a demand for faster computational speed. Results: We propose to tackle these challenges with Parallel Split Merge Sampling on Dirichlet Process Mixture Model (the Para-DPMM model). Unlike classic DPMM methods that perform sampling on each single data point, the split merge mechanism samples on the cluster level, which significantly improves convergence and optimality of the result. The model is highly parallelized and can utilize the computing power of high performance computing (HPC) clusters, enabling massive clustering on huge datasets. Experiment results show the model outperforms current widely used models in both clustering quality and computational speed. Availability: Source code is publicly available on https://github.com/tiehangd/Para_DPMM/tree/master/Para_DPMM_package
△ Less
Submitted 25 December, 2018;
originally announced December 2018.
-
Sudden change of quantum discord for a system of two qubits
Authors:
João P. G. Pinto,
Goktug Karpat,
Felipe F. Fanchini
Abstract:
It is known that quantum discord might experience a sudden transition in its dynamics when calculated for certain Bell-diagonal states (BDS) that are in interaction with their surroundings. We examine this phenomenon known as the sudden change of quantum discord, considering the case of two qubits independently interacting with dephasing reservoirs. We first numerically demonstrate that, for a cla…
▽ More
It is known that quantum discord might experience a sudden transition in its dynamics when calculated for certain Bell-diagonal states (BDS) that are in interaction with their surroundings. We examine this phenomenon known as the sudden change of quantum discord, considering the case of two qubits independently interacting with dephasing reservoirs. We first numerically demonstrate that, for a class of initial states which can be chosen arbitrarily close to BDS, the transition is in fact not sudden, although it might numerically appear so if not studied carefully. Then, we provide an extension of this discussion covering the X-shaped density matrices. Our findings suggest that the transition of quantum discord might be sudden only for an highly idealized zero-measure subset of states within the set of all possible initial conditions of two qubits.
△ Less
Submitted 30 September, 2013;
originally announced September 2013.
-
Relativistic deuteron structure function at large Q^2
Authors:
J. Paulo Pinto,
A. Amorim,
F. D. Santos
Abstract:
The deuteron deep inelastic unpolarized structure function F_2^D is calculated using the Wilson operator product expansion method. The long distance behaviour, related to the deuteron bound state properties, is evaluated using the Bethe-Salpeter equation with one particle on mass shell. The calculation of the ratio F_2^D/F_2^N is compared with other convolution models showing important deviation…
▽ More
The deuteron deep inelastic unpolarized structure function F_2^D is calculated using the Wilson operator product expansion method. The long distance behaviour, related to the deuteron bound state properties, is evaluated using the Bethe-Salpeter equation with one particle on mass shell. The calculation of the ratio F_2^D/F_2^N is compared with other convolution models showing important deviations in the region of large x. The implications in the evaluation of the neutron structure function from combined data on deuterons and protons are discussed.
△ Less
Submitted 21 November, 1997;
originally announced November 1997.