Search | arXiv e-print repository

Exploring the Impact of Temperature on Large Language Models:Hot or Cold?

Authors: Lujun Li, Lama Sleem, Niccolo' Gentile, Geoffrey Nichil, Radu State

Abstract: The sampling temperature, a critical hyperparameter in large language models (LLMs), modifies the logits before the softmax layer, thereby reshaping the distribution of output tokens. Recent studies have challenged the Stochastic Parrots analogy by demonstrating that LLMs are capable of understanding semantics rather than merely memorizing data and that randomness, modulated by sampling temperatur… ▽ More The sampling temperature, a critical hyperparameter in large language models (LLMs), modifies the logits before the softmax layer, thereby reshaping the distribution of output tokens. Recent studies have challenged the Stochastic Parrots analogy by demonstrating that LLMs are capable of understanding semantics rather than merely memorizing data and that randomness, modulated by sampling temperature, plays a crucial role in model inference. In this study, we systematically evaluated the impact of temperature in the range of 0 to 2 on data sets designed to assess six different capabilities, conducting statistical analyses on open source models of three different sizes: small (1B--4B), medium (6B--13B), and large (40B--80B). Our findings reveal distinct skill-specific effects of temperature on model performance, highlighting the complexity of optimal temperature selection in practical applications. To address this challenge, we propose a BERT-based temperature selector that takes advantage of these observed effects to identify the optimal temperature for a given prompt. We demonstrate that this approach can significantly improve the performance of small and medium models in the SuperGLUE datasets. Furthermore, our study extends to FP16 precision inference, revealing that temperature effects are consistent with those observed in 4-bit quantized models. By evaluating temperature effects up to 4.0 in three quantized models, we find that the Mutation Temperature -- the point at which significant performance changes occur -- increases with model size. △ Less

Submitted 8 June, 2025; originally announced June 2025.

arXiv:2505.16078 [pdf, ps, other]

Small Language Models in the Real World: Insights from Industrial Text Classification

Authors: Lujun Li, Lama Sleem, Niccolo' Gentile, Geoffrey Nichil, Radu State

Abstract: With the emergence of ChatGPT, Transformer models have significantly advanced text classification and related tasks. Decoder-only models such as Llama exhibit strong performance and flexibility, yet they suffer from inefficiency on inference due to token-by-token generation, and their effectiveness in text classification tasks heavily depends on prompt quality. Moreover, their substantial GPU reso… ▽ More With the emergence of ChatGPT, Transformer models have significantly advanced text classification and related tasks. Decoder-only models such as Llama exhibit strong performance and flexibility, yet they suffer from inefficiency on inference due to token-by-token generation, and their effectiveness in text classification tasks heavily depends on prompt quality. Moreover, their substantial GPU resource requirements often limit widespread adoption. Thus, the question of whether smaller language models are capable of effectively handling text classification tasks emerges as a topic of significant interest. However, the selection of appropriate models and methodologies remains largely underexplored. In this paper, we conduct a comprehensive evaluation of prompt engineering and supervised fine-tuning methods for transformer-based text classification. Specifically, we focus on practical industrial scenarios, including email classification, legal document categorization, and the classification of extremely long academic texts. We examine the strengths and limitations of smaller models, with particular attention to both their performance and their efficiency in Video Random-Access Memory (VRAM) utilization, thereby providing valuable insights for the local deployment and application of compact models in industrial settings. △ Less

Submitted 23 May, 2025; v1 submitted 21 May, 2025; originally announced May 2025.

arXiv:2503.24102 [pdf, ps, other]

Is LLM the Silver Bullet to Low-Resource Languages Machine Translation?

Authors: Yewei Song, Lujun Li, Cedric Lothritz, Saad Ezzini, Lama Sleem, Niccolo Gentile, Radu State, Tegawendé F. Bissyandé, Jacques Klein

Abstract: Low-Resource Languages (LRLs) present significant challenges in natural language processing due to their limited linguistic resources and underrepresentation in standard datasets. While recent advances in Large Language Models (LLMs) and Neural Machine Translation have substantially improved translation capabilities for high-resource languages, performance disparities persist for LRLs, particularl… ▽ More Low-Resource Languages (LRLs) present significant challenges in natural language processing due to their limited linguistic resources and underrepresentation in standard datasets. While recent advances in Large Language Models (LLMs) and Neural Machine Translation have substantially improved translation capabilities for high-resource languages, performance disparities persist for LRLs, particularly impacting privacy-sensitive and resource-constrained scenarios. This paper systematically evaluates current LLMs in 200 languages using the FLORES-200 benchmark and demonstrates their limitations in LRL translation capability. We also explore alternative data sources, including news articles and bilingual dictionaries, and demonstrate how knowledge distillation from large pre-trained teacher models can significantly improve the performance of small LLMs on LRL translation tasks. For example, this approach increases EN->LB with the LLM-as-a-Judge score on the validation set from 0.36 to 0.89 for Llama-3.2-3B. Furthermore, we examine different fine-tuning configurations, providing practical insights on optimal data scale, training efficiency, and the preservation of generalization capabilities of models under study. △ Less

Submitted 5 June, 2025; v1 submitted 31 March, 2025; originally announced March 2025.

arXiv:2411.17863 [pdf, other]

doi 10.1109/BigData62323.2024.10824946

LongKey: Keyphrase Extraction for Long Documents

Authors: Jeovane Honorio Alves, Radu State, Cinthia Obladen de Almendra Freitas, Jean Paul Barddal

Abstract: In an era of information overload, manually annotating the vast and growing corpus of documents and scholarly papers is increasingly impractical. Automated keyphrase extraction addresses this challenge by identifying representative terms within texts. However, most existing methods focus on short documents (up to 512 tokens), leaving a gap in processing long-context documents. In this paper, we in… ▽ More In an era of information overload, manually annotating the vast and growing corpus of documents and scholarly papers is increasingly impractical. Automated keyphrase extraction addresses this challenge by identifying representative terms within texts. However, most existing methods focus on short documents (up to 512 tokens), leaving a gap in processing long-context documents. In this paper, we introduce LongKey, a novel framework for extracting keyphrases from lengthy documents, which uses an encoder-based language model to capture extended text intricacies. LongKey uses a max-pooling embedder to enhance keyphrase candidate representation. Validated on the comprehensive LDKP datasets and six diverse, unseen datasets, LongKey consistently outperforms existing unsupervised and language model-based keyphrase extraction methods. Our findings demonstrate LongKey's versatility and superior performance, marking an advancement in keyphrase extraction for varied text lengths and domains. △ Less

Submitted 26 November, 2024; originally announced November 2024.

Comments: Accepted for presentation at the 2024 IEEE International Conference on Big Data (IEEE BigData 2024). Code available at https://github.com/jeohalves/longkey

arXiv:2405.08044 [pdf, other]

On the Volatility of Shapley-Based Contribution Metrics in Federated Learning

Authors: Arno Geimer, Beltran Fiz, Radu State

Abstract: Federated learning (FL) is a collaborative and privacy-preserving Machine Learning paradigm, allowing the development of robust models without the need to centralize sensitive data. A critical challenge in FL lies in fairly and accurately allocating contributions from diverse participants. Inaccurate allocation can undermine trust, lead to unfair compensation, and thus participants may lack the in… ▽ More Federated learning (FL) is a collaborative and privacy-preserving Machine Learning paradigm, allowing the development of robust models without the need to centralize sensitive data. A critical challenge in FL lies in fairly and accurately allocating contributions from diverse participants. Inaccurate allocation can undermine trust, lead to unfair compensation, and thus participants may lack the incentive to join or actively contribute to the federation. Various remuneration strategies have been proposed to date, including auction-based approaches and Shapley-value-based methods, the latter offering a means to quantify the contribution of each participant. However, little to no work has studied the stability of these contribution evaluation methods. In this paper, we evaluate participant contributions in federated learning using gradient-based model reconstruction techniques with Shapley values and compare the round-based contributions to a classic data contribution measurement scheme. We provide an extensive analysis of the discrepancies of Shapley values across a set of aggregation strategies and examine them on an overall and a per-client level. We show that, between different aggregation techniques, Shapley values lead to unstable reward allocations among participants. Our analysis spans various data heterogeneity distributions, including independent and identically distributed (IID) and non-IID scenarios. △ Less

Submitted 26 May, 2025; v1 submitted 13 May, 2024; originally announced May 2024.

Comments: Accepted for publication at IJCNN 2025

arXiv:2401.07398 [pdf, other]

doi 10.1109/ACCESS.2024.3436620

Cross Domain Early Crop Mapping using CropSTGAN

Authors: Yiqun Wang, Hui Huang, Radu State

Abstract: Driven by abundant satellite imagery, machine learning-based approaches have recently been promoted to generate high-resolution crop cultivation maps to support many agricultural applications. One of the major challenges faced by these approaches is the limited availability of ground truth labels. In the absence of ground truth, existing work usually adopts the "direct transfer strategy" that trai… ▽ More Driven by abundant satellite imagery, machine learning-based approaches have recently been promoted to generate high-resolution crop cultivation maps to support many agricultural applications. One of the major challenges faced by these approaches is the limited availability of ground truth labels. In the absence of ground truth, existing work usually adopts the "direct transfer strategy" that trains a classifier using historical labels collected from other regions and then applies the trained model to the target region. Unfortunately, the spectral features of crops exhibit inter-region and inter-annual variability due to changes in soil composition, climate conditions, and crop progress, the resultant models perform poorly on new and unseen regions or years. Despite recent efforts, such as the application of the deep adaptation neural network (DANN) model structure in the deep adaptation crop classification network (DACCN), to tackle the above cross-domain challenges, their effectiveness diminishes significantly when there is a large dissimilarity between the source and target regions. This paper introduces the Crop Mapping Spectral-temporal Generative Adversarial Neural Network (CropSTGAN), a novel solution for cross-domain challenges, that doesn't require target domain labels. CropSTGAN learns to transform the target domain's spectral features to those of the source domain, effectively bridging large dissimilarities. Additionally, it employs an identity loss to maintain the intrinsic local structure of the data. Comprehensive experiments across various regions and years demonstrate the benefits and effectiveness of the proposed approach. In experiments, CropSTGAN is benchmarked against various state-of-the-art (SOTA) methods. Notably, CropSTGAN significantly outperforms these methods in scenarios with large data distribution dissimilarities between the target and source domains. △ Less

Submitted 18 April, 2024; v1 submitted 14 January, 2024; originally announced January 2024.

arXiv:2301.10209 [pdf, other]

XRP-NDN Overlay: Improving the Communication Efficiency of Consensus-Validation based Blockchains with an NDN Overlay

Authors: Lucian Trestioreanu, Wazen M. Shbair, Flaviene Scheidt de Cristo, Radu State

Abstract: With the growing adoption of Distributed Ledger Technologies and the subsequent scaling of these networks, there is an inherent need for efficient and resilient communication used by the underlying consensus and replication mechanisms. While resilient and efficient communication is one of the main pillars of an efficient blockchain network as a whole, the Distributed Ledger Technology is still rel… ▽ More With the growing adoption of Distributed Ledger Technologies and the subsequent scaling of these networks, there is an inherent need for efficient and resilient communication used by the underlying consensus and replication mechanisms. While resilient and efficient communication is one of the main pillars of an efficient blockchain network as a whole, the Distributed Ledger Technology is still relatively new and the task of scaling these networks has come with its own challenges towards ensuring these goals. New content distribution concepts like Information Centric Networking, of which Named Data Networking is a worthy example, create new possibilities towards achieving this goal, through in-network caching or built-in native multicasting, for example. We present and evaluate XRP-NDN Overlay, a solution for increasing the communication efficiency for consensus-validation based blockchains like the XRP Ledger. We experiment by sending the XRP Ledger consensus messages over different Named Data Networking communication models and prove that our chosen model lowers the number of messages at node level to minimum necessary, while maintaining or improving blockchain performance by leveraging the possibilities offered by an overlay such as specific communication mechanisms. △ Less

Submitted 24 January, 2023; originally announced January 2023.

Comments: 8 pages (arxiv); IEEE NOMS 2023 conference (4 pages)

arXiv:2206.10446 [pdf, other]

Deep dive into Interledger: Understanding the Interledger ecosystem

Authors: Lucian Trestioreanu, Cyril Cassagnes, Radu State

Abstract: At the technical level, the goal of Interledger is to provide an architecture and a minimal set of protocols to enable interoperability between any value transfer systems. The Interledger protocol is a protocol for inter-blockchain payments which can also accommodate FIAT currencies. To understand how it is possible to achieve this goal, several aspects of the technology require a deeper analysis.… ▽ More At the technical level, the goal of Interledger is to provide an architecture and a minimal set of protocols to enable interoperability between any value transfer systems. The Interledger protocol is a protocol for inter-blockchain payments which can also accommodate FIAT currencies. To understand how it is possible to achieve this goal, several aspects of the technology require a deeper analysis. For this reason, in our journey to become knowledgeable and active contributors we decided to create our own test-bed on our premises. By doing so, we noticed that some aspects are well documented but we found that others might need more attention and clarification. Despite a large community effort, the task to keep information on a fast evolving software ecosystem up-to-date is tedious and not always the main priority for such a project. The purpose of this tutorial is to guide, through several examples and hands-on activities, community members who want to engage at different levels. The tutorial consolidates all the relevant information from generating a simple payment to ultimately creating a test-bed with the Interledger protocol suite between Ripple and other distributed ledger technologies. △ Less

Submitted 21 June, 2022; originally announced June 2022.

Comments: 65 pages, 28 figures, 4 tables

arXiv:2206.04185 [pdf, other]

doi 10.1145/3517745.3561448

A Flash(bot) in the Pan: Measuring Maximal Extractable Value in Private Pools

Authors: Ben Weintraub, Christof Ferreira Torres, Cristina Nita-Rotaru, Radu State

Abstract: The rise of Ethereum has lead to a flourishing decentralized marketplace that has, unfortunately, fallen victim to frontrunning and Maximal Extractable Value (MEV) activities, where savvy participants game transaction orderings within a block for profit. One popular solution to address such behavior is Flashbots, a private pool with infrastructure and design goals aimed at eliminating the negative… ▽ More The rise of Ethereum has lead to a flourishing decentralized marketplace that has, unfortunately, fallen victim to frontrunning and Maximal Extractable Value (MEV) activities, where savvy participants game transaction orderings within a block for profit. One popular solution to address such behavior is Flashbots, a private pool with infrastructure and design goals aimed at eliminating the negative externalities associated with MEV. While Flashbots has established laudable goals to address MEV behavior, no evidence has been provided to show that these goals are achieved in practice. In this paper, we measure the popularity of Flashbots and evaluate if it is meeting its chartered goals. We find that (1) Flashbots miners account for over 99.9% of the hashing power in the Ethereum network, (2) powerful miners are making more than $2\times$ what they were making prior to using Flashbots, while non-miners' slice of the pie has shrunk commensurately, (3) mining is just as centralized as it was prior to Flashbots with more than 90% of Flashbots blocks coming from just two miners, and (4) while more than 80% of MEV extraction in Ethereum is happening through Flashbots, 13.2% is coming from other private pools. △ Less

Submitted 28 September, 2022; v1 submitted 8 June, 2022; originally announced June 2022.

Comments: 14 pages, ACM IMC 2022

arXiv:2205.00869 [pdf, other]

doi 10.1145/3555776.3577611

Topology Analysis of the XRP Ledger

Authors: Vytautas Tumas, Sean Rivera, Damien Magoni, Radu State

Abstract: XRP Ledger is one of the oldest, well-established blockchains. Despite the popularity of the XRP Ledger, little is known about its underlying peer-to-peer network. The structural properties of a network impact its efficiency, security and robustness. We aim to close the knowledge gap by providing a detailed analysis of the XRP overlay network. In this paper we examine the graph-theoretic propert… ▽ More XRP Ledger is one of the oldest, well-established blockchains. Despite the popularity of the XRP Ledger, little is known about its underlying peer-to-peer network. The structural properties of a network impact its efficiency, security and robustness. We aim to close the knowledge gap by providing a detailed analysis of the XRP overlay network. In this paper we examine the graph-theoretic properties of the XRP Ledger peer-to-peer network and its temporal characteristics. We crawl the XRP Ledger over two months and collect 1,290 unique network snapshots. We uncover a small group of nodes that act as a networking backbone. In addition, we observe a high network churn, with a third of the nodes changing every five days. Our findings have strong implications for the resilience and safety of the XRP Ledger. △ Less

Submitted 10 January, 2023; v1 submitted 2 May, 2022; originally announced May 2022.

Comments: Extended edition, 8 pages. In The 38th ACM/SIGAPP Symposium on Applied Computing, March 27 - March 31, 2023, Tallinn, Estonia

arXiv:2110.09207 [pdf, other]

SPON: Enabling Resilient Inter-Ledgers Payments with an Intrusion-Tolerant Overlay

Authors: Lucian Trestioreanu, Cristina Nita-Rotaru, Aanchal Malhotra, Radu State

Abstract: Payment systems are a critical component of everyday life in our society. While in many situations payments are still slow, opaque, siloed, expensive or even fail, users expect them to be fast, transparent, cheap, reliable and global. Recent technologies such as distributed ledgers create opportunities for near-real-time, cheaper and more transparent payments. However, in order to achieve a global… ▽ More Payment systems are a critical component of everyday life in our society. While in many situations payments are still slow, opaque, siloed, expensive or even fail, users expect them to be fast, transparent, cheap, reliable and global. Recent technologies such as distributed ledgers create opportunities for near-real-time, cheaper and more transparent payments. However, in order to achieve a global payment system, payments should be possible not only within one ledger, but also across different ledgers and geographies. In this paper we propose Secure Payments with Overlay Networks (SPON), a service that enables global payments across multiple ledgers by combining the transaction exchange provided by the Interledger protocol with an intrusion-tolerant overlay of relay nodes to achieve (1) improved payment latency, (2) fault tolerance to benign failures such as node failures and network partitions, and (3) resilience to BGP hijacking attacks. We discuss the design goals and present an implementation based on the Interledger protocol and Spines overlay network. We analyze the resilience of SPON and demonstrate through experimental evaluation that it is able to improve payment latency, recover from path outages, withstand network partition attacks, and disseminate payments fairly across multiple ledgers. We also show how SPON can be deployed to make the communication between different ledgers resilient to BGP hijacking attacks. △ Less

Submitted 3 November, 2021; v1 submitted 18 October, 2021; originally announced October 2021.

Comments: 9 pages, 14 figures, IEEE Conference on Communications and Network Security October 2021

arXiv:2108.10071 [pdf, other]

Elysium: Context-Aware Bytecode-Level Patching to Automatically Heal Vulnerable Smart Contracts

Authors: Christof Ferreira Torres, Hugo Jonker, Radu State

Abstract: Fixing bugs is easiest by patching source code. However, source code is not always available: only 0.3% of the ~49M smart contracts that are currently deployed on Ethereum have their source code publicly available. Moreover, since contracts may call functions from other contracts, security flaws in closed-source contracts may affect open-source contracts as well. However, current state-of-the-art… ▽ More Fixing bugs is easiest by patching source code. However, source code is not always available: only 0.3% of the ~49M smart contracts that are currently deployed on Ethereum have their source code publicly available. Moreover, since contracts may call functions from other contracts, security flaws in closed-source contracts may affect open-source contracts as well. However, current state-of-the-art approaches that operate on closed-source contracts (i.e., EVM bytecode), such as EVMPatch and SmartShield, make use of purely hard-coded templates that leverage fix patching patterns. As a result, they cannot dynamically adapt to the bytecode that is being patched, which severely limits their flexibility and scalability. For instance, when patching integer overflows using hard-coded templates, a particular patch template needs to be employed as the bounds to be checked are different for each integer size. In this paper, we propose Elysium, a scalable approach towards automatic smart contract repair at the bytecode level. Elysium combines template-based and semantic-based patching by inferring context information from bytecode. Elysium is currently able to patch 7 different types of vulnerabilities in smart contracts automatically and can easily be extended with new templates and new bug-finding tools. We evaluate its effectiveness and correctness using 3 different datasets by replaying more than 500K transactions on patched contracts. We find that Elysium outperforms existing tools by patching at least 30% more contracts correctly. Finally, we also compare the overhead of Elysium in terms of deployment and transaction cost. In comparison to other tools, we find that generally Elysium minimizes the runtime cost (i.e., transaction cost) up to a factor of 1.7, for only a marginally higher deployment cost, where deployment cost is a one-time cost as compared to the runtime cost. △ Less

Submitted 4 July, 2022; v1 submitted 23 August, 2021; originally announced August 2021.

arXiv:2106.11036 [pdf, ps, other]

Know Your Model (KYM): Increasing Trust in AI and Machine Learning

Authors: Mary Roszel, Robert Norvill, Jean Hilger, Radu State

Abstract: The widespread utilization of AI systems has drawn attention to the potential impacts of such systems on society. Of particular concern are the consequences that prediction errors may have on real-world scenarios, and the trust humanity places in AI systems. It is necessary to understand how we can evaluate trustworthiness in AI and how individuals and entities alike can develop trustworthy AI sys… ▽ More The widespread utilization of AI systems has drawn attention to the potential impacts of such systems on society. Of particular concern are the consequences that prediction errors may have on real-world scenarios, and the trust humanity places in AI systems. It is necessary to understand how we can evaluate trustworthiness in AI and how individuals and entities alike can develop trustworthy AI systems. In this paper, we analyze each element of trustworthiness and provide a set of 20 guidelines that can be leveraged to ensure optimal AI functionality while taking into account the greater ethical, technical, and practical impacts to humanity. Moreover, the guidelines help ensure that trustworthiness is provable and can be demonstrated, they are implementation agnostic, and they can be applied to any AI system in any sector. △ Less

Submitted 31 May, 2021; originally announced June 2021.

Comments: 10 pages

arXiv:2102.03347 [pdf, other]

Frontrunner Jones and the Raiders of the Dark Forest: An Empirical Study of Frontrunning on the Ethereum Blockchain

Authors: Christof Ferreira Torres, Ramiro Camino, Radu State

Abstract: Ethereum prospered the inception of a plethora of smart contract applications, ranging from gambling games to decentralized finance. However, Ethereum is also considered a highly adversarial environment, where vulnerable smart contracts will eventually be exploited. Recently, Ethereum's pool of pending transaction has become a far more aggressive environment. In the hope of making some profit, att… ▽ More Ethereum prospered the inception of a plethora of smart contract applications, ranging from gambling games to decentralized finance. However, Ethereum is also considered a highly adversarial environment, where vulnerable smart contracts will eventually be exploited. Recently, Ethereum's pool of pending transaction has become a far more aggressive environment. In the hope of making some profit, attackers continuously monitor the transaction pool and try to frontrun their victims' transactions by either displacing or suppressing them, or strategically inserting their transactions. This paper aims to shed some light into what is known as a dark forest and uncover these predators' actions. We present a methodology to efficiently measure the three types of frontrunning: displacement, insertion, and suppression. We perform a large-scale analysis on more than 11M blocks and identify almost 200K attacks with an accumulated profit of 18.41M USD for the attackers, providing evidence that frontrunning is both, lucrative and a prevalent issue. △ Less

Submitted 3 June, 2021; v1 submitted 5 February, 2021; originally announced February 2021.

arXiv:2101.06204 [pdf, other]

The Eye of Horus: Spotting and Analyzing Attacks on Ethereum Smart Contracts

Authors: Christof Ferreira Torres, Antonio Ken Iannillo, Arthur Gervais, Radu State

Abstract: In recent years, Ethereum gained tremendously in popularity, growing from a daily transaction average of 10K in January 2016 to an average of 500K in January 2020. Similarly, smart contracts began to carry more value, making them appealing targets for attackers. As a result, they started to become victims of attacks, costing millions of dollars. In response to these attacks, both academia and indu… ▽ More In recent years, Ethereum gained tremendously in popularity, growing from a daily transaction average of 10K in January 2016 to an average of 500K in January 2020. Similarly, smart contracts began to carry more value, making them appealing targets for attackers. As a result, they started to become victims of attacks, costing millions of dollars. In response to these attacks, both academia and industry proposed a plethora of tools to scan smart contracts for vulnerabilities before deploying them on the blockchain. However, most of these tools solely focus on detecting vulnerabilities and not attacks, let alone quantifying or tracing the number of stolen assets. In this paper, we present Horus, a framework that empowers the automated detection and investigation of smart contract attacks based on logic-driven and graph-driven analysis of transactions. Horus provides quick means to quantify and trace the flow of stolen assets across the Ethereum blockchain. We perform a large-scale analysis of all the smart contracts deployed on Ethereum until May 2020. We identified 1,888 attacked smart contracts and 8,095 adversarial transactions in the wild. Our investigation shows that the number of attacks did not necessarily decrease over the past few years, but for some vulnerabilities remained constant. Finally, we also demonstrate the practicality of our framework via an in-depth analysis on the recent Uniswap and Lendf.me attacks. △ Less

Submitted 15 January, 2021; originally announced January 2021.

arXiv:2005.12156 [pdf, other]

ConFuzzius: A Data Dependency-Aware Hybrid Fuzzer for Smart Contracts

Authors: Christof Ferreira Torres, Antonio Ken Iannillo, Arthur Gervais, Radu State

Abstract: Smart contracts are Turing-complete programs that are executed across a blockchain. Unlike traditional programs, once deployed, they cannot be modified. As smart contracts carry more value, they become more of an exciting target for attackers. Over the last years, they suffered from exploits costing millions of dollars due to simple programming mistakes. As a result, a variety of tools for detecti… ▽ More Smart contracts are Turing-complete programs that are executed across a blockchain. Unlike traditional programs, once deployed, they cannot be modified. As smart contracts carry more value, they become more of an exciting target for attackers. Over the last years, they suffered from exploits costing millions of dollars due to simple programming mistakes. As a result, a variety of tools for detecting bugs have been proposed. Most of these tools rely on symbolic execution, which may yield false positives due to over-approximation. Recently, many fuzzers have been proposed to detect bugs in smart contracts. However, these tend to be more effective in finding shallow bugs and less effective in finding bugs that lie deep in the execution, therefore achieving low code coverage and many false negatives. An alternative that has proven to achieve good results in traditional programs is hybrid fuzzing, a combination of symbolic execution and fuzzing. In this work, we study hybrid fuzzing on smart contracts and present ConFuzzius, the first hybrid fuzzer for smart contracts. ConFuzzius uses evolutionary fuzzing to exercise shallow parts of a smart contract and constraint solving to generate inputs that satisfy complex conditions that prevent evolutionary fuzzing from exploring deeper parts. Moreover, ConFuzzius leverages dynamic data dependency analysis to efficiently generate sequences of transactions that are more likely to result in contract states in which bugs may be hidden. We evaluate the effectiveness of ConFuzzius by comparing it with state-of-the-art symbolic execution tools and fuzzers for smart contracts. Our evaluation on a curated dataset of 128 contracts and 21K real-world contracts shows that our hybrid approach detects more bugs (up to 23%) while outperforming state-of-the-art in terms of code coverage (up to 69%), and that data dependency analysis boosts bug detection up to 18%. △ Less

Submitted 10 March, 2021; v1 submitted 25 May, 2020; originally announced May 2020.

arXiv:2005.03773 [pdf, other]

Minority Class Oversampling for Tabular Data with Deep Generative Models

Authors: Ramiro Camino, Christian Hammerschmidt, Radu State

Abstract: In practice, machine learning experts are often confronted with imbalanced data. Without accounting for the imbalance, common classifiers perform poorly and standard evaluation metrics mislead the practitioners on the model's performance. A common method to treat imbalanced datasets is under- and oversampling. In this process, samples are either removed from the majority class or synthetic samples… ▽ More In practice, machine learning experts are often confronted with imbalanced data. Without accounting for the imbalance, common classifiers perform poorly and standard evaluation metrics mislead the practitioners on the model's performance. A common method to treat imbalanced datasets is under- and oversampling. In this process, samples are either removed from the majority class or synthetic samples are added to the minority class. In this paper, we follow up on recent developments in deep learning. We take proposals of deep generative models, including our own, and study the ability of these approaches to provide realistic samples that improve performance on imbalanced classification tasks via oversampling. Across 160K+ experiments, we show that all of the new methods tend to perform better than simple baseline methods such as SMOTE, but require different under- and oversampling ratios to do so. Our experiments show that the way the method of sampling does not affect quality, but runtime varies widely. We also observe that the improvements in terms of performance metric, while shown to be significant when ranking the methods, often are minor in absolute terms, especially compared to the required effort. Furthermore, we notice that a large part of the improvement is due to undersampling, not oversampling. We make our code and testing framework available. △ Less

Submitted 20 July, 2020; v1 submitted 7 May, 2020; originally announced May 2020.

arXiv:2003.09241 [pdf]

Blockchain Governance: An Overview and Prediction of Optimal Strategies using Nash Equilibrium

Authors: Nida Khan, Tabrez Ahmad, Anass Patel, Radu State

Abstract: Blockchain governance is a subject of ongoing research and an interdisciplinary view of blockchain governance is vital to aid in further research for establishing a formal governance framework for this nascent technology. In this paper, the position of blockchain governance within the hierarchy of Institutional governance is discussed. Blockchain governance is analyzed from the perspective of IT g… ▽ More Blockchain governance is a subject of ongoing research and an interdisciplinary view of blockchain governance is vital to aid in further research for establishing a formal governance framework for this nascent technology. In this paper, the position of blockchain governance within the hierarchy of Institutional governance is discussed. Blockchain governance is analyzed from the perspective of IT governance using Nash equilibrium to predict the outcome of different governance decisions. A payoff matrix for blockchain governance is created and simulation of different strategy profiles is accomplished for computation of all Nash equilibria. The paper elaborates upon payoff matrices for different kinds of blockchain governance, which are used in the proposition of novel mathematical formulae usable to predict the best governance strategy that minimizes the occurrence of a hard fork as well as predicts the behavior of the majority during protocol updates. The paper also includes validation of the proposed formulae using real Ethereum data. △ Less

Submitted 20 March, 2020; originally announced March 2020.

Comments: Accepted for publication in AUEIRC-Springer 2020

arXiv:1910.01449 [pdf, ps, other]

A Data Science Approach for Honeypot Detection in Ethereum

Authors: Ramiro Camino, Christof Ferreira Torres, Mathis Baden, Radu State

Abstract: Ethereum smart contracts have recently drawn a considerable amount of attention from the media, the financial industry and academia. With the increase in popularity, malicious users found new opportunities to profit by deceiving newcomers. Consequently, attackers started luring other attackers into contracts that seem to have exploitable flaws, but that actually contain a complex hidden trap that… ▽ More Ethereum smart contracts have recently drawn a considerable amount of attention from the media, the financial industry and academia. With the increase in popularity, malicious users found new opportunities to profit by deceiving newcomers. Consequently, attackers started luring other attackers into contracts that seem to have exploitable flaws, but that actually contain a complex hidden trap that in the end benefits the contract creator. In the blockchain community, these contracts are known as honeypots. A recent study presented a tool called HONEYBADGER that uses symbolic execution to detect honeypots by analyzing contract bytecode. In this paper, we present a data science detection approach based foremost on the contract transaction behavior. We create a partition of all the possible cases of fund movements between the contract creator, the contract, the transaction sender and other participants. To this end, we add transaction aggregated features, such as the number of transactions and the corresponding mean value and other contract features, for example compilation information and source code length. We find that all aforementioned categories of features contain useful information for the detection of honeypots. Moreover, our approach allows us to detect new, previously undetected honeypots of already known techniques. We furthermore employ our method to test the detection of unknown honeypot techniques by sequentially removing one technique from the training set. We show that our method is capable of discovering the removed honeypot techniques. Finally, we discovered two new techniques that were previously not known. △ Less

Submitted 19 December, 2019; v1 submitted 3 October, 2019; originally announced October 2019.

arXiv:1908.09899 [pdf, other]

SynGAN: Towards Generating Synthetic Network Attacks using GANs

Authors: Jeremy Charlier, Aman Singh, Gaston Ormazabal, Radu State, Henning Schulzrinne

Abstract: The rapid digital transformation without security considerations has resulted in the rise of global-scale cyberattacks. The first line of defense against these attacks are Network Intrusion Detection Systems (NIDS). Once deployed, however, these systems work as blackboxes with a high rate of false positives with no measurable effectiveness. There is a need to continuously test and improve these sy… ▽ More The rapid digital transformation without security considerations has resulted in the rise of global-scale cyberattacks. The first line of defense against these attacks are Network Intrusion Detection Systems (NIDS). Once deployed, however, these systems work as blackboxes with a high rate of false positives with no measurable effectiveness. There is a need to continuously test and improve these systems by emulating real-world network attack mutations. We present SynGAN, a framework that generates adversarial network attacks using the Generative Adversial Networks (GAN). SynGAN generates malicious packet flow mutations using real attack traffic, which can improve NIDS attack detection rates. As a first step, we compare two public datasets, NSL-KDD and CICIDS2017, for generating synthetic Distributed Denial of Service (DDoS) network attacks. We evaluate the attack quality (real vs. synthetic) using a gradient boosting classifier. △ Less

Submitted 26 August, 2019; originally announced August 2019.

arXiv:1905.13020 [pdf, other]

Visualization of AE's Training on Credit Card Transactions with Persistent Homology

Authors: Jeremy Charlier, Francois Petit, Gaston Ormazabal, Radu State, Jean Hilger

Abstract: Auto-encoders are among the most popular neural network architecture for dimension reduction. They are composed of two parts: the encoder which maps the model distribution to a latent manifold and the decoder which maps the latent manifold to a reconstructed distribution. However, auto-encoders are known to provoke chaotically scattered data distribution in the latent manifold resulting in an inco… ▽ More Auto-encoders are among the most popular neural network architecture for dimension reduction. They are composed of two parts: the encoder which maps the model distribution to a latent manifold and the decoder which maps the latent manifold to a reconstructed distribution. However, auto-encoders are known to provoke chaotically scattered data distribution in the latent manifold resulting in an incomplete reconstructed distribution. Current distance measures fail to detect this problem because they are not able to acknowledge the shape of the data manifolds, i.e. their topological features, and the scale at which the manifolds should be analyzed. We propose Persistent Homology for Wasserstein Auto-Encoders, called PHom-WAE, a new methodology to assess and measure the data distribution of a generative model. PHom-WAE minimizes the Wasserstein distance between the true distribution and the reconstructed distribution and uses persistent homology, the study of the topological features of a space at different spatial resolutions, to compare the nature of the latent manifold and the reconstructed distribution. Our experiments underline the potential of persistent homology for Wasserstein Auto-Encoders in comparison to Variational Auto-Encoders, another type of generative model. The experiments are conducted on a real-world data set particularly challenging for traditional distance measures and auto-encoders. PHom-WAE is the first methodology to propose a topological distance measure, the bottleneck distance, for Wasserstein Auto-Encoders used to compare decoded samples of high quality in the context of credit card transactions. △ Less

Submitted 12 August, 2019; v1 submitted 24 May, 2019; originally announced May 2019.

Comments: arXiv admin note: substantial text overlap with arXiv:1905.09894

arXiv:1905.12568 [pdf, other]

Predicting Sparse Clients' Actions with CPOPT-Net in the Banking Environment

Authors: Jeremy Charlier, Radu State, Jean Hilger

Abstract: The digital revolution of the banking system with evolving European regulations have pushed the major banking actors to innovate by a newly use of their clients' digital information. Given highly sparse client activities, we propose CPOPT-Net, an algorithm that combines the CP canonical tensor decomposition, a multidimensional matrix decomposition that factorizes a tensor as the sum of rank-one te… ▽ More The digital revolution of the banking system with evolving European regulations have pushed the major banking actors to innovate by a newly use of their clients' digital information. Given highly sparse client activities, we propose CPOPT-Net, an algorithm that combines the CP canonical tensor decomposition, a multidimensional matrix decomposition that factorizes a tensor as the sum of rank-one tensors, and neural networks. CPOPT-Net removes efficiently sparse information with a gradient-based resolution while relying on neural networks for time series predictions. Our experiments show that CPOPT-Net is capable to perform accurate predictions of the clients' actions in the context of personalized recommendation. CPOPT-Net is the first algorithm to use non-linear conjugate gradient tensor resolution with neural networks to propose predictions of financial activities on a public data set. △ Less

Submitted 23 May, 2019; originally announced May 2019.

arXiv:1905.12567 [pdf, other]

MQLV: Optimal Policy of Money Management in Retail Banking with Q-Learning

Authors: Jeremy Charlier, Gaston Ormazabal, Radu State, Jean Hilger

Abstract: Reinforcement learning has become one of the best approach to train a computer game emulator capable of human level performance. In a reinforcement learning approach, an optimal value function is learned across a set of actions, or decisions, that leads to a set of states giving different rewards, with the objective to maximize the overall reward. A policy assigns to each state-action pairs an exp… ▽ More Reinforcement learning has become one of the best approach to train a computer game emulator capable of human level performance. In a reinforcement learning approach, an optimal value function is learned across a set of actions, or decisions, that leads to a set of states giving different rewards, with the objective to maximize the overall reward. A policy assigns to each state-action pairs an expected return. We call an optimal policy a policy for which the value function is optimal. QLBS, Q-Learner in the Black-Scholes(-Merton) Worlds, applies the reinforcement learning concepts, and noticeably, the popular Q-learning algorithm, to the financial stochastic model of Black, Scholes and Merton. It is, however, specifically optimized for the geometric Brownian motion and the vanilla options. Its range of application is, therefore, limited to vanilla option pricing within financial markets. We propose MQLV, Modified Q-Learner for the Vasicek model, a new reinforcement learning approach that determines the optimal policy of money management based on the aggregated financial transactions of the clients. It unlocks new frontiers to establish personalized credit card limits or to fulfill bank loan applications, targeting the retail banking industry. MQLV extends the simulation to mean reverting stochastic diffusion processes and it uses a digital function, a Heaviside step function expressed in its discrete form, to estimate the probability of a future event such as a payment default. In our experiments, we first show the similarities between a set of historical financial transactions and Vasicek generated transactions and, then, we underline the potential of MQLV on generated Monte Carlo simulations. Finally, MQLV is the first Q-learning Vasicek-based methodology addressing transparent decision making processes in retail banking. △ Less

Submitted 21 August, 2019; v1 submitted 24 May, 2019; originally announced May 2019.

arXiv:1905.10363 [pdf, other]

User-Device Authentication in Mobile Banking using APHEN for Paratuck2 Tensor Decomposition

Authors: Jeremy Charlier, Eric Falk, Radu State, Jean Hilger

Abstract: The new financial European regulations such as PSD2 are changing the retail banking services. Noticeably, the monitoring of the personal expenses is now opened to other institutions than retail banks. Nonetheless, the retail banks are looking to leverage the user-device authentication on the mobile banking applications to enhance the personal financial advertisement. To address the profiling of th… ▽ More The new financial European regulations such as PSD2 are changing the retail banking services. Noticeably, the monitoring of the personal expenses is now opened to other institutions than retail banks. Nonetheless, the retail banks are looking to leverage the user-device authentication on the mobile banking applications to enhance the personal financial advertisement. To address the profiling of the authentication, we rely on tensor decomposition, a higher dimensional analogue of matrix decomposition. We use Paratuck2, which expresses a tensor as a multiplication of matrices and diagonal tensors, because of the imbalance between the number of users and devices. We highlight why Paratuck2 is more appropriate in this case than the popular CP tensor decomposition, which decomposes a tensor as a sum of rank-one tensors. However, the computation of Paratuck2 is computational intensive. We propose a new APproximate HEssian-based Newton resolution algorithm, APHEN, capable of solving Paratuck2 more accurately and faster than the other popular approaches based on alternating least square or gradient descent. The results of Paratuck2 are used for the predictions of users' authentication with neural networks. We apply our method for the concrete case of targeting clients for financial advertising campaigns based on the authentication events generated by mobile banking applications. △ Less

Submitted 23 May, 2019; originally announced May 2019.

arXiv:1905.09894 [pdf, other]

PHom-GeM: Persistent Homology for Generative Models

Authors: Jeremy Charlier, Radu State, Jean Hilger

Abstract: Generative neural network models, including Generative Adversarial Network (GAN) and Auto-Encoders (AE), are among the most popular neural network models to generate adversarial data. The GAN model is composed of a generator that produces synthetic data and of a discriminator that discriminates between the generator's output and the true data. AE consist of an encoder which maps the model distribu… ▽ More Generative neural network models, including Generative Adversarial Network (GAN) and Auto-Encoders (AE), are among the most popular neural network models to generate adversarial data. The GAN model is composed of a generator that produces synthetic data and of a discriminator that discriminates between the generator's output and the true data. AE consist of an encoder which maps the model distribution to a latent manifold and of a decoder which maps the latent manifold to a reconstructed distribution. However, generative models are known to provoke chaotically scattered reconstructed distribution during their training, and consequently, incomplete generated adversarial distributions. Current distance measures fail to address this problem because they are not able to acknowledge the shape of the data manifold, i.e. its topological features, and the scale at which the manifold should be analyzed. We propose Persistent Homology for Generative Models, PHom-GeM, a new methodology to assess and measure the distribution of a generative model. PHom-GeM minimizes an objective function between the true and the reconstructed distributions and uses persistent homology, the study of the topological features of a space at different spatial resolutions, to compare the nature of the true and the generated distributions. Our experiments underline the potential of persistent homology for Wasserstein GAN in comparison to Wasserstein AE and Variational AE. The experiments are conducted on a real-world data set particularly challenging for traditional distance measures and generative neural network models. PHom-GeM is the first methodology to propose a topological distance measure, the bottleneck distance, for generative models used to compare adversarial samples in the context of credit card transactions. △ Less

Submitted 23 May, 2019; originally announced May 2019.

arXiv:1905.09869 [pdf, other]

Non-Negative PARATUCK2 Tensor Decomposition Combined to LSTM Network For Smart Contracts Profiling

Authors: Jeremy Charlier, Radu State, Jean Hilger

Abstract: Smart contracts are programs stored and executed on a blockchain. The Ethereum platform, an open-source blockchain-based platform, has been designed to use these programs offering secured protocols and transaction costs reduction. The Ethereum Virtual Machine performs smart contracts runs, where the execution of each contract is limited to the amount of gas required to execute the operations descr… ▽ More Smart contracts are programs stored and executed on a blockchain. The Ethereum platform, an open-source blockchain-based platform, has been designed to use these programs offering secured protocols and transaction costs reduction. The Ethereum Virtual Machine performs smart contracts runs, where the execution of each contract is limited to the amount of gas required to execute the operations described in the code. Each gas unit must be paid using Ether, the crypto-currency of the platform. Due to smart contracts interactions evolving over time, analyzing the behavior of smart contracts is very challenging. We address this challenge in our paper. We develop for this purpose an innovative approach based on the non-negative tensor decomposition PARATUCK2 combined with long short-term memory (LSTM) to assess if predictive analysis can forecast smart contracts interactions over time. To validate our methodology, we report results for two use cases. The main use case is related to analyzing smart contracts and allows shedding some light into the complex interactions among smart contracts. In order to show the generality of our method on other use cases, we also report its performance on video on demand recommendation. △ Less

Submitted 23 May, 2019; originally announced May 2019.

arXiv:1902.11212 [pdf, other]

Infer Your Enemies and Know Yourself, Learning in Real-Time Bidding with Partially Observable Opponents

Authors: Manxing Du, Alexander I. Cowen-Rivers, Ying Wen, Phu Sakulwongtana, Jun Wang, Mats Brorsson, Radu State

Abstract: Real-time bidding, as one of the most popular mechanisms for selling online ad slots, facilitates advertisers to reach their potential customers. The goal of bidding optimization is to maximize the advertisers' return on investment (ROI) under a certain budget setting. A straightforward solution is to model the bidding function in an explicit form. However, the static functional solutions lack gen… ▽ More Real-time bidding, as one of the most popular mechanisms for selling online ad slots, facilitates advertisers to reach their potential customers. The goal of bidding optimization is to maximize the advertisers' return on investment (ROI) under a certain budget setting. A straightforward solution is to model the bidding function in an explicit form. However, the static functional solutions lack generality in practice and are insensitive to the stochastic behaviour of other bidders in the environment. In this paper, we propose a general multi-agent framework with actor-critic solutions facing against playing imperfect information games. We firstly introduce a novel Deep Attentive Survival Analysis (DASA) model to infer the censored data in the second price auctions which outperforms start-of-the-art survival analysis. Furthermore, our approach introduces the DASA model as the opponent model into the policy learning process for each agent and develop a mean field equilibrium analysis of the second price auctions. The experiments have shown that with the inference of the market, the market converges to the equilibrium much faster while playing against both fixed strategy agents and dynamic learning agents. △ Less

Submitted 28 February, 2019; originally announced February 2019.

arXiv:1902.10666 [pdf, other]

Improving Missing Data Imputation with Deep Generative Models

Authors: Ramiro D. Camino, Christian A. Hammerschmidt, Radu State

Abstract: Datasets with missing values are very common on industry applications, and they can have a negative impact on machine learning models. Recent studies introduced solutions to the problem of imputing missing values based on deep generative models. Previous experiments with Generative Adversarial Networks and Variational Autoencoders showed interesting results in this domain, but it is not clear whic… ▽ More Datasets with missing values are very common on industry applications, and they can have a negative impact on machine learning models. Recent studies introduced solutions to the problem of imputing missing values based on deep generative models. Previous experiments with Generative Adversarial Networks and Variational Autoencoders showed interesting results in this domain, but it is not clear which method is preferable for different use cases. The goal of this work is twofold: we present a comparison between missing data imputation solutions based on deep generative models, and we propose improvements over those methodologies. We run our experiments using known real life datasets with different characteristics, removing values at random and reconstructing them with several imputation techniques. Our results show that the presence or absence of categorical variables can alter the selection of the best model, and that some models are more stable than others after similar runs with different random number generator seeds. △ Less

Submitted 27 February, 2019; originally announced February 2019.

arXiv:1902.06976 [pdf, other]

The Art of The Scam: Demystifying Honeypots in Ethereum Smart Contracts

Authors: Christof Ferreira Torres, Mathis Steichen, Radu State

Abstract: Modern blockchains, such as Ethereum, enable the execution of so-called smart contracts - programs that are executed across a decentralised network of nodes. As smart contracts become more popular and carry more value, they become more of an interesting target for attackers. In the past few years, several smart contracts have been exploited by attackers. However, a new trend towards a more proacti… ▽ More Modern blockchains, such as Ethereum, enable the execution of so-called smart contracts - programs that are executed across a decentralised network of nodes. As smart contracts become more popular and carry more value, they become more of an interesting target for attackers. In the past few years, several smart contracts have been exploited by attackers. However, a new trend towards a more proactive approach seems to be on the rise, where attackers do not search for vulnerable contracts anymore. Instead, they try to lure their victims into traps by deploying seemingly vulnerable contracts that contain hidden traps. This new type of contracts is commonly referred to as honeypots. In this paper, we present the first systematic analysis of honeypot smart contracts, by investigating their prevalence, behaviour and impact on the Ethereum blockchain. We develop a taxonomy of honeypot techniques and use this to build HoneyBadger - a tool that employs symbolic execution and well defined heuristics to expose honeypots. We perform a large-scale analysis on more than 2 million smart contracts and show that our tool not only achieves high precision, but is also highly efficient. We identify 690 honeypot smart contracts as well as 240 victims in the wild, with an accumulated profit of more than $90,000 for the honeypot creators. Our manual validation shows that 87% of the reported contracts are indeed honeypots. △ Less

Submitted 29 May, 2019; v1 submitted 19 February, 2019; originally announced February 2019.

arXiv:1807.01202 [pdf, other]

Generating Multi-Categorical Samples with Generative Adversarial Networks

Authors: Ramiro Camino, Christian Hammerschmidt, Radu State

Abstract: We propose a method to train generative adversarial networks on mutivariate feature vectors representing multiple categorical values. In contrast to the continuous domain, where GAN-based methods have delivered considerable results, GANs struggle to perform equally well on discrete data. We propose and compare several architectures based on multiple (Gumbel) softmax output layers taking into accou… ▽ More We propose a method to train generative adversarial networks on mutivariate feature vectors representing multiple categorical values. In contrast to the continuous domain, where GAN-based methods have delivered considerable results, GANs struggle to perform equally well on discrete data. We propose and compare several architectures based on multiple (Gumbel) softmax output layers taking into account the structure of the data. We evaluate the performance of our architecture on datasets with different sparsity, number of features, ranges of categorical values, and dependencies among the features. Our proposed architecture and method outperforms existing models. △ Less

Submitted 4 July, 2018; v1 submitted 3 July, 2018; originally announced July 2018.

Journal ref: Presented at the ICML 2018 workshop on Theoretical Foundations and Applications of Deep Generative Models, Stockholm, Sweden

arXiv:1803.00897 [pdf, other]

Impact of Biases in Big Data

Authors: Patrick Glauner, Petko Valtchev, Radu State

Abstract: The underlying paradigm of big data-driven machine learning reflects the desire of deriving better conclusions from simply analyzing more data, without the necessity of looking at theory and models. Is having simply more data always helpful? In 1936, The Literary Digest collected 2.3M filled in questionnaires to predict the outcome of that year's US presidential election. The outcome of this big d… ▽ More The underlying paradigm of big data-driven machine learning reflects the desire of deriving better conclusions from simply analyzing more data, without the necessity of looking at theory and models. Is having simply more data always helpful? In 1936, The Literary Digest collected 2.3M filled in questionnaires to predict the outcome of that year's US presidential election. The outcome of this big data prediction proved to be entirely wrong, whereas George Gallup only needed 3K handpicked people to make an accurate prediction. Generally, biases occur in machine learning whenever the distributions of training set and test set are different. In this work, we provide a review of different sorts of biases in (big) data sets in machine learning. We provide definitions and discussions of the most commonly appearing biases in machine learning: class imbalance and covariate shift. We also show how these biases can be quantified and corrected. This work is an introductory text for both researchers and practitioners to become more aware of this topic and thus to derive more reliable models for their learning problems. △ Less

Submitted 2 March, 2018; originally announced March 2018.

Journal ref: Proceedings of the 26th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN 2018)

arXiv:1801.05627 [pdf, ps, other]

On the Reduction of Biases in Big Data Sets for the Detection of Irregular Power Usage

Authors: Patrick Glauner, Radu State, Petko Valtchev, Diogo Duarte

Abstract: In machine learning, a bias occurs whenever training sets are not representative for the test data, which results in unreliable models. The most common biases in data are arguably class imbalance and covariate shift. In this work, we aim to shed light on this topic in order to increase the overall attention to this issue in the field of machine learning. We propose a scalable novel framework for r… ▽ More In machine learning, a bias occurs whenever training sets are not representative for the test data, which results in unreliable models. The most common biases in data are arguably class imbalance and covariate shift. In this work, we aim to shed light on this topic in order to increase the overall attention to this issue in the field of machine learning. We propose a scalable novel framework for reducing multiple biases in high-dimensional data sets in order to train more reliable predictors. We apply our methodology to the detection of irregular power usage from real, noisy industrial data. In emerging markets, irregular power usage, and electricity theft in particular, may range up to 40% of the total electricity distributed. Biased data sets are of particular issue in this domain. We show that reducing these biases increases the accuracy of the trained predictors. Our models have the potential to generate significant economic value in a real world application, as they are being deployed in a commercial software for the detection of irregular power usage. △ Less

Submitted 3 April, 2018; v1 submitted 17 January, 2018; originally announced January 2018.

Journal ref: Proceedings of the 13th International FLINS Conference on Data Science and Knowledge Engineering for Sensing Decision Support (FLINS 2018)

arXiv:1709.03008 [pdf, other]

Identifying Irregular Power Usage by Turning Predictions into Holographic Spatial Visualizations

Authors: Patrick Glauner, Niklas Dahringer, Oleksandr Puhachov, Jorge Augusto Meira, Petko Valtchev, Radu State, Diogo Duarte

Abstract: Power grids are critical infrastructure assets that face non-technical losses (NTL) such as electricity theft or faulty meters. NTL may range up to 40% of the total electricity distributed in emerging countries. Industrial NTL detection systems are still largely based on expert knowledge when deciding whether to carry out costly on-site inspections of customers. Electricity providers are reluctant… ▽ More Power grids are critical infrastructure assets that face non-technical losses (NTL) such as electricity theft or faulty meters. NTL may range up to 40% of the total electricity distributed in emerging countries. Industrial NTL detection systems are still largely based on expert knowledge when deciding whether to carry out costly on-site inspections of customers. Electricity providers are reluctant to move to large-scale deployments of automated systems that learn NTL profiles from data due to the latter's propensity to suggest a large number of unnecessary inspections. In this paper, we propose a novel system that combines automated statistical decision making with expert knowledge. First, we propose a machine learning framework that classifies customers into NTL or non-NTL using a variety of features derived from the customers' consumption data. The methodology used is specifically tailored to the level of noise in the data. Second, in order to allow human experts to feed their knowledge in the decision loop, we propose a method for visualizing prediction results at various granularity levels in a spatial hologram. Our approach allows domain experts to put the classification results into the context of the data and to incorporate their knowledge for making the final decisions of which customers to inspect. This work has resulted in appreciable results on a real-world data set of 3.6M customers. Our system is being deployed in a commercial NTL detection software. △ Less

Submitted 9 September, 2017; originally announced September 2017.

Comments: Proceedings of the 17th IEEE International Conference on Data Mining Workshops (ICDMW 2017)

arXiv:1707.09430 [pdf, ps, other]

Human in the Loop: Interactive Passive Automata Learning via Evidence-Driven State-Merging Algorithms

Authors: Christian A. Hammerschmidt, Radu State, Sicco Verwer

Abstract: We present an interactive version of an evidence-driven state-merging (EDSM) algorithm for learning variants of finite state automata. Learning these automata often amounts to recovering or reverse engineering the model generating the data despite noisy, incomplete, or imperfectly sampled data sources rather than optimizing a purely numeric target function. Domain expertise and human knowledge abo… ▽ More We present an interactive version of an evidence-driven state-merging (EDSM) algorithm for learning variants of finite state automata. Learning these automata often amounts to recovering or reverse engineering the model generating the data despite noisy, incomplete, or imperfectly sampled data sources rather than optimizing a purely numeric target function. Domain expertise and human knowledge about the target domain can guide this process, and typically is captured in parameter settings. Often, domain expertise is subconscious and not expressed explicitly. Directly interacting with the learning algorithm makes it easier to utilize this knowledge effectively. △ Less

Submitted 28 July, 2017; originally announced July 2017.

Comments: 4 pages, presented at the Human in the Loop workshop at ICML 2017

arXiv:1703.10121 [pdf, ps, other]

The Top 10 Topics in Machine Learning Revisited: A Quantitative Meta-Study

Authors: Patrick Glauner, Manxing Du, Victor Paraschiv, Andrey Boytsov, Isabel Lopez Andrade, Jorge Meira, Petko Valtchev, Radu State

Abstract: Which topics of machine learning are most commonly addressed in research? This question was initially answered in 2007 by doing a qualitative survey among distinguished researchers. In our study, we revisit this question from a quantitative perspective. Concretely, we collect 54K abstracts of papers published between 2007 and 2016 in leading machine learning journals and conferences. We then use m… ▽ More Which topics of machine learning are most commonly addressed in research? This question was initially answered in 2007 by doing a qualitative survey among distinguished researchers. In our study, we revisit this question from a quantitative perspective. Concretely, we collect 54K abstracts of papers published between 2007 and 2016 in leading machine learning journals and conferences. We then use machine learning in order to determine the top 10 topics in machine learning. We not only include models, but provide a holistic view across optimization, data, features, etc. This quantitative approach allows reducing the bias of surveys. It reveals new and up-to-date insights into what the 10 most prolific topics in machine learning research are. This allows researchers to identify popular topics as well as new and rising topics for their research. △ Less

Submitted 29 March, 2017; originally announced March 2017.

Journal ref: Proceedings of the 25th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN 2017)

arXiv:1702.03767 [pdf, other]

Is Big Data Sufficient for a Reliable Detection of Non-Technical Losses?

Authors: Patrick Glauner, Angelo Migliosi, Jorge Meira, Petko Valtchev, Radu State, Franck Bettinger

Abstract: Non-technical losses (NTL) occur during the distribution of electricity in power grids and include, but are not limited to, electricity theft and faulty meters. In emerging countries, they may range up to 40% of the total electricity distributed. In order to detect NTLs, machine learning methods are used that learn irregular consumption patterns from customer data and inspection results. The Big D… ▽ More Non-technical losses (NTL) occur during the distribution of electricity in power grids and include, but are not limited to, electricity theft and faulty meters. In emerging countries, they may range up to 40% of the total electricity distributed. In order to detect NTLs, machine learning methods are used that learn irregular consumption patterns from customer data and inspection results. The Big Data paradigm followed in modern machine learning reflects the desire of deriving better conclusions from simply analyzing more data, without the necessity of looking at theory and models. However, the sample of inspected customers may be biased, i.e. it does not represent the population of all customers. As a consequence, machine learning models trained on these inspection results are biased as well and therefore lead to unreliable predictions of whether customers cause NTL or not. In machine learning, this issue is called covariate shift and has not been addressed in the literature on NTL detection yet. In this work, we present a novel framework for quantifying and visualizing covariate shift. We apply it to a commercial data set from Brazil that consists of 3.6M customers and 820K inspection results. We show that some features have a stronger covariate shift than others, making predictions less reliable. In particular, previous inspections were focused on certain neighborhoods or customer classes and that they were not sufficiently spread among the population of customers. This framework is about to be deployed in a commercial product for NTL detection. △ Less

Submitted 25 July, 2017; v1 submitted 13 February, 2017; originally announced February 2017.

Comments: Proceedings of the 19th International Conference on Intelligent System Applications to Power Systems (ISAP 2017)

arXiv:1611.07100 [pdf, other]

Interpreting Finite Automata for Sequential Data

Authors: Christian Albert Hammerschmidt, Sicco Verwer, Qin Lin, Radu State

Abstract: Automaton models are often seen as interpretable models. Interpretability itself is not well defined: it remains unclear what interpretability means without first explicitly specifying objectives or desired attributes. In this paper, we identify the key properties used to interpret automata and propose a modification of a state-merging approach to learn variants of finite state automata. We apply… ▽ More Automaton models are often seen as interpretable models. Interpretability itself is not well defined: it remains unclear what interpretability means without first explicitly specifying objectives or desired attributes. In this paper, we identify the key properties used to interpret automata and propose a modification of a state-merging approach to learn variants of finite state automata. We apply the approach to problems beyond typical grammar inference tasks. Additionally, we cover several use-cases for prediction, classification, and clustering on sequential data in both supervised and unsupervised scenarios to show how the identified key properties are applicable in a wide range of contexts. △ Less

Submitted 24 November, 2016; v1 submitted 21 November, 2016; originally announced November 2016.

Comments: Presented at NIPS 2016 Workshop on Interpretable Machine Learning in Complex Systems

ACM Class: I.2.6

arXiv:1607.00872 [pdf, other]

Neighborhood Features Help Detecting Non-Technical Losses in Big Data Sets

Authors: Patrick Glauner, Jorge Meira, Lautaro Dolberg, Radu State, Franck Bettinger, Yves Rangoni, Diogo Duarte

Abstract: Electricity theft is a major problem around the world in both developed and developing countries and may range up to 40% of the total electricity distributed. More generally, electricity theft belongs to non-technical losses (NTL), which are losses that occur during the distribution of electricity in power grids. In this paper, we build features from the neighborhood of customers. We first split t… ▽ More Electricity theft is a major problem around the world in both developed and developing countries and may range up to 40% of the total electricity distributed. More generally, electricity theft belongs to non-technical losses (NTL), which are losses that occur during the distribution of electricity in power grids. In this paper, we build features from the neighborhood of customers. We first split the area in which the customers are located into grids of different sizes. For each grid cell we then compute the proportion of inspected customers and the proportion of NTL found among the inspected customers. We then analyze the distributions of features generated and show why they are useful to predict NTL. In addition, we compute features from the consumption time series of customers. We also use master data features of customers, such as their customer class and voltage of their connection. We compute these features for a Big Data base of 31M meter readings, 700K customers and 400K inspection results. We then use these features to train four machine learning algorithms that are particularly suitable for Big Data sets because of their parallelizable structure: logistic regression, k-nearest neighbors, linear support vector machine and random forest. Using the neighborhood features instead of only analyzing the time series has resulted in appreciable results for Big Data sets for varying NTL proportions of 1%-90%. This work can therefore be deployed to a wide range of different regions around the world. △ Less

Submitted 25 July, 2017; v1 submitted 4 July, 2016; originally announced July 2016.

Comments: Proceedings of the 3rd IEEE/ACM International Conference on Big Data Computing Applications and Technologies (BDCAT 2016)

arXiv:1606.00626 [pdf, other]

doi 10.2991/ijcis.2017.10.1.51

The Challenge of Non-Technical Loss Detection using Artificial Intelligence: A Survey

Authors: Patrick Glauner, Jorge Augusto Meira, Petko Valtchev, Radu State, Franck Bettinger

Abstract: Detection of non-technical losses (NTL) which include electricity theft, faulty meters or billing errors has attracted increasing attention from researchers in electrical engineering and computer science. NTLs cause significant harm to the economy, as in some countries they may range up to 40% of the total electricity distributed. The predominant research direction is employing artificial intellig… ▽ More Detection of non-technical losses (NTL) which include electricity theft, faulty meters or billing errors has attracted increasing attention from researchers in electrical engineering and computer science. NTLs cause significant harm to the economy, as in some countries they may range up to 40% of the total electricity distributed. The predominant research direction is employing artificial intelligence to predict whether a customer causes NTL. This paper first provides an overview of how NTLs are defined and their impact on economies, which include loss of revenue and profit of electricity providers and decrease of the stability and reliability of electrical power grids. It then surveys the state-of-the-art research efforts in a up-to-date and comprehensive review of algorithms, features and data sets used. It finally identifies the key scientific and engineering challenges in NTL detection and suggests how they could be addressed in the future. △ Less

Submitted 25 July, 2017; v1 submitted 2 June, 2016; originally announced June 2016.

Journal ref: International Journal of Computational Intelligence Systems (IJCIS), vol. 10, issue 1, pp. 760-775, 2017

arXiv:1602.08350 [pdf, ps, other]

Large-Scale Detection of Non-Technical Losses in Imbalanced Data Sets

Authors: Patrick O. Glauner, Andre Boechat, Lautaro Dolberg, Radu State, Franck Bettinger, Yves Rangoni, Diogo Duarte

Abstract: Non-technical losses (NTL) such as electricity theft cause significant harm to our economies, as in some countries they may range up to 40% of the total electricity distributed. Detecting NTLs requires costly on-site inspections. Accurate prediction of NTLs for customers using machine learning is therefore crucial. To date, related research largely ignore that the two classes of regular and non-re… ▽ More Non-technical losses (NTL) such as electricity theft cause significant harm to our economies, as in some countries they may range up to 40% of the total electricity distributed. Detecting NTLs requires costly on-site inspections. Accurate prediction of NTLs for customers using machine learning is therefore crucial. To date, related research largely ignore that the two classes of regular and non-regular customers are highly imbalanced, that NTL proportions may change and mostly consider small data sets, often not allowing to deploy the results in production. In this paper, we present a comprehensive approach to assess three NTL detection models for different NTL proportions in large real world data sets of 100Ks of customers: Boolean rules, fuzzy logic and Support Vector Machine. This work has resulted in appreciable results that are about to be deployed in a leading industry solution. We believe that the considerations and observations made in this contribution are necessary for future smart meter research in order to report their effectiveness on imbalanced and large real world data sets. △ Less

Submitted 25 July, 2017; v1 submitted 26 February, 2016; originally announced February 2016.

Comments: Proceedings of the Seventh IEEE Conference on Innovative Smart Grid Technologies (ISGT 2016)

arXiv:1208.2877 [pdf, other]

Torinj : Automated Exploitation Malware Targeting Tor Users

Authors: Gerard Wagener, Alexandre Dulaunoy, Radu State

Abstract: We propose in this paper a new propagation vector for malicious software by abusing the Tor network. Tor is particularly relevant, since operating a Tor exit node is easy and involves low costs compared to attack institutional or ISP networks. After presenting the Tor network from an attacker perspective, we describe an automated exploitation malware which is operated on a Tor exit node targeting… ▽ More We propose in this paper a new propagation vector for malicious software by abusing the Tor network. Tor is particularly relevant, since operating a Tor exit node is easy and involves low costs compared to attack institutional or ISP networks. After presenting the Tor network from an attacker perspective, we describe an automated exploitation malware which is operated on a Tor exit node targeting to infect web browsers. Our experiments show that the current deployed Tor network, provides a large amount of potential victims. △ Less

Submitted 14 August, 2012; originally announced August 2012.

arXiv:cs/0610109 [pdf, ps, other]

Intrusion detection mechanisms for VoIP applications

Authors: Mohamed El Baker Nassar, Radu State, Olivier Festor

Abstract: VoIP applications are emerging today as an important component in business and communication industry. In this paper, we address the intrusion detection and prevention in VoIP networks and describe how a conceptual solution based on the Bayes inference approach can be used to reinforce the existent security mechanisms. Our approach is based on network monitoring and analyzing of the VoIP-specifi… ▽ More VoIP applications are emerging today as an important component in business and communication industry. In this paper, we address the intrusion detection and prevention in VoIP networks and describe how a conceptual solution based on the Bayes inference approach can be used to reinforce the existent security mechanisms. Our approach is based on network monitoring and analyzing of the VoIP-specific traffic. We give a detailed example on attack detection using the SIP signaling protocol. △ Less

Submitted 18 October, 2006; originally announced October 2006.

Journal ref: Dans Third annual VoIP security workshop (VSW'06) (2006)

Showing 1–42 of 42 results for author: State, R