Search | arXiv e-print repository

SLOT: Structuring the Output of Large Language Models

Authors: Darren Yow-Bang Wang, Zhengyuan Shen, Soumya Smruti Mishra, Zhichao Xu, Yifei Teng, Haibo Ding

Abstract: Structured outputs are essential for large language models (LLMs) in critical applications like agents and information extraction. Despite their capabilities, LLMs often generate outputs that deviate from predefined schemas, significantly hampering reliable application development. We present SLOT (Structured LLM Output Transformer), a model-agnostic approach that transforms unstructured LLM outpu… ▽ More Structured outputs are essential for large language models (LLMs) in critical applications like agents and information extraction. Despite their capabilities, LLMs often generate outputs that deviate from predefined schemas, significantly hampering reliable application development. We present SLOT (Structured LLM Output Transformer), a model-agnostic approach that transforms unstructured LLM outputs into precise structured formats. While existing solutions predominantly rely on constrained decoding techniques or are tightly coupled with specific models, SLOT employs a fine-tuned lightweight language model as a post-processing layer, achieving flexibility across various LLMs and schema specifications. We introduce a systematic pipeline for data curation and synthesis alongside a formal evaluation methodology that quantifies both schema accuracy and content fidelity. Our results demonstrate that fine-tuned Mistral-7B model with constrained decoding achieves near perfect schema accuracy (99.5%) and content similarity (94.0%), outperforming Claude-3.5-Sonnet by substantial margins (+25 and +20 percentage points, respectively). Notably, even compact models like Llama-3.2-1B can match or exceed the structured output capabilities of much larger proprietary models when equipped with SLOT, enabling reliable structured generation in resource-constrained environments. △ Less

Submitted 6 May, 2025; originally announced May 2025.

arXiv:2410.15827 [pdf, other]

Explainability of Highly Associated Fuzzy Churn Patterns in Binary Classification

Authors: D. Y. C. Wang, Lars Arne Jordanger, Jerry Chun-Wei Lin

Abstract: Customer churn, particularly in the telecommunications sector, influences both costs and profits. As the explainability of models becomes increasingly important, this study emphasizes not only the explainability of customer churn through machine learning models, but also the importance of identifying multivariate patterns and setting soft bounds for intuitive interpretation. The main objective is… ▽ More Customer churn, particularly in the telecommunications sector, influences both costs and profits. As the explainability of models becomes increasingly important, this study emphasizes not only the explainability of customer churn through machine learning models, but also the importance of identifying multivariate patterns and setting soft bounds for intuitive interpretation. The main objective is to use a machine learning model and fuzzy-set theory with top-\textit{k} HUIM to identify highly associated patterns of customer churn with intuitive identification, referred to as Highly Associated Fuzzy Churn Patterns (HAFCP). Moreover, this method aids in uncovering association rules among multiple features across low, medium, and high distributions. Such discoveries are instrumental in enhancing the explainability of findings. Experiments show that when the top-5 HAFCPs are included in five datasets, a mixture of performance results is observed, with some showing notable improvements. It becomes clear that high importance features enhance explanatory power through their distribution and patterns associated with other features. As a result, the study introduces an innovative approach that improves the explainability and effectiveness of customer churn prediction models. △ Less

Submitted 21 October, 2024; originally announced October 2024.

Comments: 18 pages single columns, 4 figures, This paper is an extended version of a work originally presented at the 6th International Workshop on Utility-Driven Mining and Learning (held in conjunction with the 28th Pacific-Asia Conference on Knowledge Discovery and Data Mining - PAKDD 2024) on May 7, 2024

arXiv:2410.07282 [pdf, other]

A Utility-Mining-Driven Active Learning Approach for Analyzing Clickstream Sequences

Authors: Danny Y. C. Wang, Lars Arne Jordanger, Jerry Chun-Wei Lin

Abstract: In rapidly evolving e-commerce industry, the capability of selecting high-quality data for model training is essential. This study introduces the High-Utility Sequential Pattern Mining using SHAP values (HUSPM-SHAP) model, a utility mining-based active learning strategy to tackle this challenge. We found that the parameter settings for positive and negative SHAP values impact the model's mining ou… ▽ More In rapidly evolving e-commerce industry, the capability of selecting high-quality data for model training is essential. This study introduces the High-Utility Sequential Pattern Mining using SHAP values (HUSPM-SHAP) model, a utility mining-based active learning strategy to tackle this challenge. We found that the parameter settings for positive and negative SHAP values impact the model's mining outcomes, introducing a key consideration into the active learning framework. Through extensive experiments aimed at predicting behaviors that do lead to purchases or not, the designed HUSPM-SHAP model demonstrates its superiority across diverse scenarios. The model's ability to mitigate labeling needs while maintaining high predictive performance is highlighted. Our findings demonstrate the model's capability to refine e-commerce data processing, steering towards more streamlined, cost-effective prediction modeling. △ Less

Submitted 9 October, 2024; originally announced October 2024.

Comments: 7 pages, 2 figures, preprint version

arXiv:2410.00260 [pdf, other]

DoPAMine: Domain-specific Pre-training Adaptation from seed-guided data Mining

Authors: Vinayak Arannil, Neha Narwal, Sourav Sanjukta Bhabesh, Sai Nikhil Thirandas, Darren Yow-Bang Wang, Graham Horwood, Alex Anto Chirayath, Gouri Pandeshwar

Abstract: Large Language Models (LLMs) have shown remarkable ability to generalize effectively across numerous industry domains while executing a range of tasks. Many of these competencies are obtained from the data utilized during the pre-training phase of the Language Models (LMs). However, these models exhibit limitations when tasked with performing in specialized or low-resource industry domains. More r… ▽ More Large Language Models (LLMs) have shown remarkable ability to generalize effectively across numerous industry domains while executing a range of tasks. Many of these competencies are obtained from the data utilized during the pre-training phase of the Language Models (LMs). However, these models exhibit limitations when tasked with performing in specialized or low-resource industry domains. More recent approaches use LLMs for generating domain-specific synthetic data but most often they lack in truthfulness and complexity. Alternatively, in cases where domain data is available like healthcare and finance most of the LMs are proprietary necessitating the need for a scalable method to curate real world industry specific pre-training data. In this work, we propose an automated and scalable framework - DoPAMine:Domain-specific Pre-training Adaptation from seed-guided data Mining, to mine domain specific training data from a large data corpus for domain adaptation of a LM. The framework leverages the parametric knowledge of a LLM to generate diverse and representative seed data tailored to a specific domain which is then used to mine real world data from a large data corpus like Common Crawl. We evaluated our framework's performance in the continual pre-training (CPT) setting by training two domain specific 7B parameter LMs in healthcare and finance with data mined via DoPAMine. Our experiments show that DoPAMine boosts the performance of pre-trained LLMs on average by 4.9% and 5.1% in zero-shot and 5-shot settings respectively on healthcare tasks from MMLU, MedQA, MedMCQA and PubMedQA datasets, and 2.9% and 6.7% for zero-shot and 5-shot settings respectively on finance tasks from FiQA-SA, FPB and Headlines datasets when compared to the baseline. △ Less

Submitted 9 October, 2024; v1 submitted 30 September, 2024; originally announced October 2024.

arXiv:1207.5442 [pdf, ps, other]

Security Analysis of a Password-Based Authentication Protocol Proposed to IEEE 1363

Authors: Z. Zhao, Z. Dongand Yongge Wang

Abstract: In recent years, several protocols for password-based authenticated key exchange have been proposed. These protocols aim to be secure even though the sample space of passwords may be small enough to be enumerated by an off-line adversary. In Eurocrypt 2000, Bellare, Pointcheval and Rogaway (BPR) presented a model and security definition for authenticated key exchange. They claimed that in the idea… ▽ More In recent years, several protocols for password-based authenticated key exchange have been proposed. These protocols aim to be secure even though the sample space of passwords may be small enough to be enumerated by an off-line adversary. In Eurocrypt 2000, Bellare, Pointcheval and Rogaway (BPR) presented a model and security definition for authenticated key exchange. They claimed that in the ideal-cipher model (random oracles), the two-flow protocol at the core of Encrypted Key Exchange (EKE) is secure. Bellare and Rogaway suggested several instantiations of the ideal cipher in their proposal to the IEEE P1363.2 working group. Since then there has been an increased interest in proving the security of password-based protocols in the ideal-cipher model. For example, Bresson, Chevassut, and Pointcheval have recently showed that the One-Encryption-Key-Exchange (OEKE) protocol is secure in the ideal cipher model. In this paper, we present examples of real (NOT ideal) ciphers (including naive implementations of the instantiations proposed to IEEE P1363.2) that would result in broken instantiations of the idealised AuthA protocol and OEKE protocol. Our result shows that the AuthA protocol can be instantiated in an insecure way, and that there are no well defined (let alone rigorous) ways to distinguish between secure and insecure instantiations. Thus, without a rigorous metric for ideal-ciphers, the value of provable security in ideal cipher model is limited. △ Less

Submitted 23 July, 2012; originally announced July 2012.

Journal ref: Theoretical Computer Science, 352(1-3):280--287, 2006

Showing 1–5 of 5 results for author: Wang, D Y