-
SLOT: Structuring the Output of Large Language Models
Authors:
Darren Yow-Bang Wang,
Zhengyuan Shen,
Soumya Smruti Mishra,
Zhichao Xu,
Yifei Teng,
Haibo Ding
Abstract:
Structured outputs are essential for large language models (LLMs) in critical applications like agents and information extraction. Despite their capabilities, LLMs often generate outputs that deviate from predefined schemas, significantly hampering reliable application development. We present SLOT (Structured LLM Output Transformer), a model-agnostic approach that transforms unstructured LLM outpu…
▽ More
Structured outputs are essential for large language models (LLMs) in critical applications like agents and information extraction. Despite their capabilities, LLMs often generate outputs that deviate from predefined schemas, significantly hampering reliable application development. We present SLOT (Structured LLM Output Transformer), a model-agnostic approach that transforms unstructured LLM outputs into precise structured formats. While existing solutions predominantly rely on constrained decoding techniques or are tightly coupled with specific models, SLOT employs a fine-tuned lightweight language model as a post-processing layer, achieving flexibility across various LLMs and schema specifications. We introduce a systematic pipeline for data curation and synthesis alongside a formal evaluation methodology that quantifies both schema accuracy and content fidelity. Our results demonstrate that fine-tuned Mistral-7B model with constrained decoding achieves near perfect schema accuracy (99.5%) and content similarity (94.0%), outperforming Claude-3.5-Sonnet by substantial margins (+25 and +20 percentage points, respectively). Notably, even compact models like Llama-3.2-1B can match or exceed the structured output capabilities of much larger proprietary models when equipped with SLOT, enabling reliable structured generation in resource-constrained environments.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
Explainability of Highly Associated Fuzzy Churn Patterns in Binary Classification
Authors:
D. Y. C. Wang,
Lars Arne Jordanger,
Jerry Chun-Wei Lin
Abstract:
Customer churn, particularly in the telecommunications sector, influences both costs and profits. As the explainability of models becomes increasingly important, this study emphasizes not only the explainability of customer churn through machine learning models, but also the importance of identifying multivariate patterns and setting soft bounds for intuitive interpretation. The main objective is…
▽ More
Customer churn, particularly in the telecommunications sector, influences both costs and profits. As the explainability of models becomes increasingly important, this study emphasizes not only the explainability of customer churn through machine learning models, but also the importance of identifying multivariate patterns and setting soft bounds for intuitive interpretation. The main objective is to use a machine learning model and fuzzy-set theory with top-\textit{k} HUIM to identify highly associated patterns of customer churn with intuitive identification, referred to as Highly Associated Fuzzy Churn Patterns (HAFCP). Moreover, this method aids in uncovering association rules among multiple features across low, medium, and high distributions. Such discoveries are instrumental in enhancing the explainability of findings. Experiments show that when the top-5 HAFCPs are included in five datasets, a mixture of performance results is observed, with some showing notable improvements. It becomes clear that high importance features enhance explanatory power through their distribution and patterns associated with other features. As a result, the study introduces an innovative approach that improves the explainability and effectiveness of customer churn prediction models.
△ Less
Submitted 21 October, 2024;
originally announced October 2024.
-
A Utility-Mining-Driven Active Learning Approach for Analyzing Clickstream Sequences
Authors:
Danny Y. C. Wang,
Lars Arne Jordanger,
Jerry Chun-Wei Lin
Abstract:
In rapidly evolving e-commerce industry, the capability of selecting high-quality data for model training is essential. This study introduces the High-Utility Sequential Pattern Mining using SHAP values (HUSPM-SHAP) model, a utility mining-based active learning strategy to tackle this challenge. We found that the parameter settings for positive and negative SHAP values impact the model's mining ou…
▽ More
In rapidly evolving e-commerce industry, the capability of selecting high-quality data for model training is essential. This study introduces the High-Utility Sequential Pattern Mining using SHAP values (HUSPM-SHAP) model, a utility mining-based active learning strategy to tackle this challenge. We found that the parameter settings for positive and negative SHAP values impact the model's mining outcomes, introducing a key consideration into the active learning framework. Through extensive experiments aimed at predicting behaviors that do lead to purchases or not, the designed HUSPM-SHAP model demonstrates its superiority across diverse scenarios. The model's ability to mitigate labeling needs while maintaining high predictive performance is highlighted. Our findings demonstrate the model's capability to refine e-commerce data processing, steering towards more streamlined, cost-effective prediction modeling.
△ Less
Submitted 9 October, 2024;
originally announced October 2024.
-
DoPAMine: Domain-specific Pre-training Adaptation from seed-guided data Mining
Authors:
Vinayak Arannil,
Neha Narwal,
Sourav Sanjukta Bhabesh,
Sai Nikhil Thirandas,
Darren Yow-Bang Wang,
Graham Horwood,
Alex Anto Chirayath,
Gouri Pandeshwar
Abstract:
Large Language Models (LLMs) have shown remarkable ability to generalize effectively across numerous industry domains while executing a range of tasks. Many of these competencies are obtained from the data utilized during the pre-training phase of the Language Models (LMs). However, these models exhibit limitations when tasked with performing in specialized or low-resource industry domains. More r…
▽ More
Large Language Models (LLMs) have shown remarkable ability to generalize effectively across numerous industry domains while executing a range of tasks. Many of these competencies are obtained from the data utilized during the pre-training phase of the Language Models (LMs). However, these models exhibit limitations when tasked with performing in specialized or low-resource industry domains. More recent approaches use LLMs for generating domain-specific synthetic data but most often they lack in truthfulness and complexity. Alternatively, in cases where domain data is available like healthcare and finance most of the LMs are proprietary necessitating the need for a scalable method to curate real world industry specific pre-training data. In this work, we propose an automated and scalable framework - DoPAMine:Domain-specific Pre-training Adaptation from seed-guided data Mining, to mine domain specific training data from a large data corpus for domain adaptation of a LM. The framework leverages the parametric knowledge of a LLM to generate diverse and representative seed data tailored to a specific domain which is then used to mine real world data from a large data corpus like Common Crawl. We evaluated our framework's performance in the continual pre-training (CPT) setting by training two domain specific 7B parameter LMs in healthcare and finance with data mined via DoPAMine. Our experiments show that DoPAMine boosts the performance of pre-trained LLMs on average by 4.9% and 5.1% in zero-shot and 5-shot settings respectively on healthcare tasks from MMLU, MedQA, MedMCQA and PubMedQA datasets, and 2.9% and 6.7% for zero-shot and 5-shot settings respectively on finance tasks from FiQA-SA, FPB and Headlines datasets when compared to the baseline.
△ Less
Submitted 9 October, 2024; v1 submitted 30 September, 2024;
originally announced October 2024.
-
Security Analysis of a Password-Based Authentication Protocol Proposed to IEEE 1363
Authors:
Z. Zhao,
Z. Dongand Yongge Wang
Abstract:
In recent years, several protocols for password-based authenticated key exchange have been proposed. These protocols aim to be secure even though the sample space of passwords may be small enough to be enumerated by an off-line adversary. In Eurocrypt 2000, Bellare, Pointcheval and Rogaway (BPR) presented a model and security definition for authenticated key exchange. They claimed that in the idea…
▽ More
In recent years, several protocols for password-based authenticated key exchange have been proposed. These protocols aim to be secure even though the sample space of passwords may be small enough to be enumerated by an off-line adversary. In Eurocrypt 2000, Bellare, Pointcheval and Rogaway (BPR) presented a model and security definition for authenticated key exchange. They claimed that in the ideal-cipher model (random oracles), the two-flow protocol at the core of Encrypted Key Exchange (EKE) is secure. Bellare and Rogaway suggested several instantiations of the ideal cipher in their proposal to the IEEE P1363.2 working group. Since then there has been an increased interest in proving the security of password-based protocols in the ideal-cipher model. For example, Bresson, Chevassut, and Pointcheval have recently showed that the One-Encryption-Key-Exchange (OEKE) protocol is secure in the ideal cipher model. In this paper, we present examples of real (NOT ideal) ciphers (including naive implementations of the instantiations proposed to IEEE P1363.2) that would result in broken instantiations of the idealised AuthA protocol and OEKE protocol. Our result shows that the AuthA protocol can be instantiated in an insecure way, and that there are no well defined (let alone rigorous) ways to distinguish between secure and insecure instantiations. Thus, without a rigorous metric for ideal-ciphers, the value of provable security in ideal cipher model is limited.
△ Less
Submitted 23 July, 2012;
originally announced July 2012.