Search | arXiv e-print repository

AI Content Self-Detection for Transformer-based Large Language Models

Authors: Antônio Junior Alves Caiado, Michael Hahsler

Abstract: $ $The usage of generative artificial intelligence (AI) tools based on large language models, including ChatGPT, Bard, and Claude, for text generation has many exciting applications with the potential for phenomenal productivity gains. One issue is authorship attribution when using AI tools. This is especially important in an academic setting where the inappropriate use of generative AI tools may… ▽ More $ $The usage of generative artificial intelligence (AI) tools based on large language models, including ChatGPT, Bard, and Claude, for text generation has many exciting applications with the potential for phenomenal productivity gains. One issue is authorship attribution when using AI tools. This is especially important in an academic setting where the inappropriate use of generative AI tools may hinder student learning or stifle research by creating a large amount of automatically generated derivative work. Existing plagiarism detection systems can trace the source of submitted text but are not yet equipped with methods to accurately detect AI-generated text. This paper introduces the idea of direct origin detection and evaluates whether generative AI systems can recognize their output and distinguish it from human-written texts. We argue why current transformer-based models may be able to self-detect their own generated text and perform a small empirical study using zero-shot learning to investigate if that is the case. Results reveal varying capabilities of AI systems to identify their generated text. Google's Bard model exhibits the largest capability of self-detection with an accuracy of 94\%, followed by OpenAI's ChatGPT with 83\%. On the other hand, Anthropic's Claude model seems to be not able to self-detect. △ Less

Submitted 28 December, 2023; originally announced December 2023.

arXiv:2305.15263 [pdf, other]

ARULESPY: Exploring Association Rules and Frequent Itemsets in Python

Authors: Michael Hahsler

Abstract: The R arules package implements a comprehensive infrastructure for representing, manipulating, and analyzing transaction data and patterns using frequent itemsets and association rules. The package also provides a wide range of interest measures and mining algorithms, including the code of Christian Borgelt's popular and efficient C implementations of the association mining algorithms Apriori and… ▽ More The R arules package implements a comprehensive infrastructure for representing, manipulating, and analyzing transaction data and patterns using frequent itemsets and association rules. The package also provides a wide range of interest measures and mining algorithms, including the code of Christian Borgelt's popular and efficient C implementations of the association mining algorithms Apriori and Eclat, and optimized C/C++ code for mining and manipulating association rules using sparse matrix representation. This document describes the new Python package arulespy, which makes this infrastructure available for Python users. △ Less

Submitted 24 May, 2023; originally announced May 2023.

arXiv:2205.12371 [pdf, other]

recommenderlab: An R Framework for Developing and Testing Recommendation Algorithms

Authors: Michael Hahsler

Abstract: Algorithms that create recommendations based on observed data have significant commercial value for online retailers and many other industries. Recommender systems have a significant research community, and studying such systems is part of most modern data science curricula. While there is an abundance of software that implements recommendation algorithms, there is little in terms of supporting re… ▽ More Algorithms that create recommendations based on observed data have significant commercial value for online retailers and many other industries. Recommender systems have a significant research community, and studying such systems is part of most modern data science curricula. While there is an abundance of software that implements recommendation algorithms, there is little in terms of supporting recommender system research and education. This paper describes the open-source software recommenderlab which was created with supporting research and education in mind. The package can be directly installed in R or downloaded from https://github.com/mhahsler/recommenderlab. △ Less

Submitted 24 May, 2022; originally announced May 2022.

arXiv:1909.04217 [pdf, other]

Swapped Face Detection using Deep Learning and Subjective Assessment

Authors: Xinyi Ding, Zohreh Raziei, Eric C. Larson, Eli V. Olinick, Paul Krueger, Michael Hahsler

Abstract: The tremendous success of deep learning for imaging applications has resulted in numerous beneficial advances. Unfortunately, this success has also been a catalyst for malicious uses such as photo-realistic face swapping of parties without consent. Transferring one person's face from a source image to a target image of another person, while keeping the image photo-realistic overall has become incr… ▽ More The tremendous success of deep learning for imaging applications has resulted in numerous beneficial advances. Unfortunately, this success has also been a catalyst for malicious uses such as photo-realistic face swapping of parties without consent. Transferring one person's face from a source image to a target image of another person, while keeping the image photo-realistic overall has become increasingly easy and automatic, even for individuals without much knowledge of image processing. In this study, we use deep transfer learning for face swapping detection, showing true positive rates >96% with very few false alarms. Distinguished from existing methods that only provide detection accuracy, we also provide uncertainty for each prediction, which is critical for trust in the deployment of such detection systems. Moreover, we provide a comparison to human subjects. To capture human recognition performance, we build a website to collect pairwise comparisons of images from human subjects. Based on these comparisons, images are ranked from most real to most fake. We compare this ranking to the outputs from our automatic model, showing good, but imperfect, correspondence with linear correlations >0.75. Overall, the results show the effectiveness of our method. As part of this study, we create a novel, publicly available dataset that is, to the best of our knowledge, the largest public swapped face dataset created using still images. Our goal of this study is to inspire more research in the field of image forensics through the creation of a public dataset and initial analysis. △ Less

Submitted 9 September, 2019; originally announced September 2019.

Comments: 8 pages, 5 figures

arXiv:0803.3224 [pdf, other]

doi 10.1007/s10618-005-0026-2

A Model-Based Frequency Constraint for Mining Associations from Transaction Data

Authors: Michael Hahsler

Abstract: Mining frequent itemsets is a popular method for finding associated items in databases. For this method, support, the co-occurrence frequency of the items which form an association, is used as the primary indicator of the associations's significance. A single user-specified support threshold is used to decided if associations should be further investigated. Support has some known problems with r… ▽ More Mining frequent itemsets is a popular method for finding associated items in databases. For this method, support, the co-occurrence frequency of the items which form an association, is used as the primary indicator of the associations's significance. A single user-specified support threshold is used to decided if associations should be further investigated. Support has some known problems with rare items, favors shorter itemsets and sometimes produces misleading associations. In this paper we develop a novel model-based frequency constraint as an alternative to a single, user-specified minimum support. The constraint utilizes knowledge of the process generating transaction data by applying a simple stochastic mixture model (the NB model) which allows for transaction data's typically highly skewed item frequency distribution. A user-specified precision threshold is used together with the model to find local frequency thresholds for groups of itemsets. Based on the constraint we develop the notion of NB-frequent itemsets and adapt a mining algorithm to find all NB-frequent itemsets in a database. In experiments with publicly available transaction databases we show that the new constraint provides improvements over a single minimum support threshold and that the precision threshold is more robust and easier to set and interpret by the user. △ Less

Submitted 21 March, 2008; originally announced March 2008.

ACM Class: H.2.8

Journal ref: Michael Hahsler. A model-based frequency constraint for mining associations from transaction data. Data Mining and Knowledge Discovery, 13(2):137-166, September 2006

arXiv:0803.1104 [pdf]

Optimizing Web Sites for Customer Retention

Authors: Michael Hahsler

Abstract: With customer relationship management (CRM) companies move away from a mainly product-centered view to a customer-centered view. Resulting from this change, the effective management of how to keep contact with customers throughout different channels is one of the key success factors in today's business world. Company Web sites have evolved in many industries into an extremely important channel t… ▽ More With customer relationship management (CRM) companies move away from a mainly product-centered view to a customer-centered view. Resulting from this change, the effective management of how to keep contact with customers throughout different channels is one of the key success factors in today's business world. Company Web sites have evolved in many industries into an extremely important channel through which customers can be attracted and retained. To analyze and optimize this channel, accurate models of how customers browse through the Web site and what information within the site they repeatedly view are crucial. Typically, data mining techniques are used for this purpose. However, there already exist numerous models developed in marketing research for traditional channels which could also prove valuable to understanding this new channel. In this paper we propose the application of an extension of the Logarithmic Series Distribution (LSD) model repeat-usage of Web-based information and thus to analyze and optimize a Web Site's capability to support one goal of CRM, to retain customers. As an example, we use the university's blended learning web portal with over a thousand learning resources to demonstrate how the model can be used to evaluate and improve the Web site's effectiveness. △ Less

Submitted 7 March, 2008; originally announced March 2008.

arXiv:0803.0966 [pdf, other]

doi 10.3233/IDA-2007-11502

New probabilistic interest measures for association rules

Authors: Michael Hahsler, Kurt Hornik

Abstract: Mining association rules is an important technique for discovering meaningful patterns in transaction databases. Many different measures of interestingness have been proposed for association rules. However, these measures fail to take the probabilistic properties of the mined data into account. In this paper, we start with presenting a simple probabilistic framework for transaction data which ca… ▽ More Mining association rules is an important technique for discovering meaningful patterns in transaction databases. Many different measures of interestingness have been proposed for association rules. However, these measures fail to take the probabilistic properties of the mined data into account. In this paper, we start with presenting a simple probabilistic framework for transaction data which can be used to simulate transaction data when no associations are present. We use such data and a real-world database from a grocery outlet to explore the behavior of confidence and lift, two popular interest measures used for rule mining. The results show that confidence is systematically influenced by the frequency of the items in the left hand side of rules and that lift performs poorly to filter random noise in transaction data. Based on the probabilistic framework we develop two new interest measures, hyper-lift and hyper-confidence, which can be used to filter or order mined association rules. The new measures show significantly better performance than lift for applications where spurious rules are problematic. △ Less

Submitted 6 March, 2008; originally announced March 2008.

Journal ref: Intelligent Data Analysis, 11(5):437-455, 2007

arXiv:0803.0954 [pdf, other]

doi 10.1007/s00180-007-0062-z

Selective association rule generation

Authors: Michael Hahsler, Christian Buchta, Kurt Hornik

Abstract: Mining association rules is a popular and well researched method for discovering interesting relations between variables in large databases. A practical problem is that at medium to low support values often a large number of frequent itemsets and an even larger number of association rules are found in a database. A widely used approach is to gradually increase minimum support and minimum confide… ▽ More Mining association rules is a popular and well researched method for discovering interesting relations between variables in large databases. A practical problem is that at medium to low support values often a large number of frequent itemsets and an even larger number of association rules are found in a database. A widely used approach is to gradually increase minimum support and minimum confidence or to filter the found rules using increasingly strict constraints on additional measures of interestingness until the set of rules found is reduced to a manageable size. In this paper we describe a different approach which is based on the idea to first define a set of ``interesting'' itemsets (e.g., by a mixture of mining and expert knowledge) and then, in a second step to selectively generate rules for only these itemsets. The main advantage of this approach over increasing thresholds or filtering rules is that the number of rules found is significantly reduced while at the same time it is not necessary to increase the support and confidence thresholds which might lead to missing important information in the database. △ Less

Submitted 6 March, 2008; originally announced March 2008.

Journal ref: Computational Statistics, 2007. Online First, Published: 25 July 2007

Showing 1–8 of 8 results for author: Hahsler, M