Search | arXiv e-print repository

doi 10.1016/j.iswa.2024.200354

Tree Boosting Methods for Balanced andImbalanced Classification and their Robustness Over Time in Risk Assessment

Authors: Gissel Velarde, Michael Weichert, Anuj Deshmunkh, Sanjay Deshmane, Anindya Sudhir, Khushboo Sharma, Vaibhav Joshi

Abstract: Most real-world classification problems deal with imbalanced datasets, posing a challenge for Artificial Intelligence (AI), i.e., machine learning algorithms, because the minority class, which is of extreme interest, often proves difficult to be detected. This paper empirically evaluates tree boosting methods' performance given different dataset sizes and class distributions, from perfectly balanc… ▽ More Most real-world classification problems deal with imbalanced datasets, posing a challenge for Artificial Intelligence (AI), i.e., machine learning algorithms, because the minority class, which is of extreme interest, often proves difficult to be detected. This paper empirically evaluates tree boosting methods' performance given different dataset sizes and class distributions, from perfectly balanced to highly imbalanced. For tabular data, tree-based methods such as XGBoost, stand out in several benchmarks due to detection performance and speed. Therefore, XGBoost and Imbalance-XGBoost are evaluated. After introducing the motivation to address risk assessment with machine learning, the paper reviews evaluation metrics for detection systems or binary classifiers. It proposes a method for data preparation followed by tree boosting methods including hyper-parameter optimization. The method is evaluated on private datasets of 1 thousand (K), 10K and 100K samples on distributions with 50, 45, 25, and 5 percent positive samples. As expected, the developed method increases its recognition performance as more data is given for training and the F1 score decreases as the data distribution becomes more imbalanced, but it is still significantly superior to the baseline of precision-recall determined by the ratio of positives divided by positives and negatives. Sampling to balance the training set does not provide consistent improvement and deteriorates detection. In contrast, classifier hyper-parameter optimization improves recognition, but should be applied carefully depending on data volume and distribution. Finally, the developed method is robust to data variation over time up to some point. Retraining can be used when performance starts deteriorating. △ Less

Submitted 25 April, 2025; originally announced April 2025.

Comments: 14 pages. arXiv admin note: text overlap with arXiv:2303.15218

Journal ref: Intelligent Systems with Applications 22 (2024) 200354

arXiv:2504.03731 [pdf, other]

A Benchmark for Scalable Oversight Protocols

Authors: Abhimanyu Pallavi Sudhir, Jackson Kaunismaa, Arjun Panickssery

Abstract: As AI agents surpass human capabilities, scalable oversight -- the problem of effectively supplying human feedback to potentially superhuman AI models -- becomes increasingly critical to ensure alignment. While numerous scalable oversight protocols have been proposed, they lack a systematic empirical framework to evaluate and compare them. While recent works have tried to empirically study scalabl… ▽ More As AI agents surpass human capabilities, scalable oversight -- the problem of effectively supplying human feedback to potentially superhuman AI models -- becomes increasingly critical to ensure alignment. While numerous scalable oversight protocols have been proposed, they lack a systematic empirical framework to evaluate and compare them. While recent works have tried to empirically study scalable oversight protocols -- particularly Debate -- we argue that the experiments they conduct are not generalizable to other protocols. We introduce the scalable oversight benchmark, a principled framework for evaluating human feedback mechanisms based on our agent score difference (ASD) metric, a measure of how effectively a mechanism advantages truth-telling over deception. We supply a Python package to facilitate rapid and competitive evaluation of scalable oversight protocols on our benchmark, and conduct a demonstrative experiment benchmarking Debate. △ Less

Submitted 31 March, 2025; originally announced April 2025.

Comments: Accepted at the ICLR 2025 Workshop on Bidirectional Human-AI Alignment (BiAlign)

arXiv:2503.05828 [pdf, other]

Market-based Architectures in RL and Beyond

Authors: Abhimanyu Pallavi Sudhir, Long Tran-Thanh

Abstract: Market-based agents refer to reinforcement learning agents which determine their actions based on an internal market of sub-agents. We introduce a new type of market-based algorithm where the state itself is factored into several axes called ``goods'', which allows for greater specialization and parallelism than existing market-based RL algorithms. Furthermore, we argue that market-based algorithm… ▽ More Market-based agents refer to reinforcement learning agents which determine their actions based on an internal market of sub-agents. We introduce a new type of market-based algorithm where the state itself is factored into several axes called ``goods'', which allows for greater specialization and parallelism than existing market-based RL algorithms. Furthermore, we argue that market-based algorithms have the potential to address many current challenges in AI, such as search, dynamic scaling and complete feedback, and demonstrate that they may be seen to generalize neural networks; finally, we list some novel ways that market algorithms may be applied in conjunction with Large Language Models for immediate practical applicability. △ Less

Submitted 5 March, 2025; originally announced March 2025.

Comments: Accepted at AAMAS 2025

arXiv:2412.18544 [pdf, other]

Consistency Checks for Language Model Forecasters

Authors: Daniel Paleka, Abhimanyu Pallavi Sudhir, Alejandro Alvarez, Vineeth Bhat, Adam Shen, Evan Wang, Florian Tramèr

Abstract: Forecasting is a task that is difficult to evaluate: the ground truth can only be known in the future. Recent work showing LLM forecasters rapidly approaching human-level performance begs the question: how can we benchmark and evaluate these forecasters instantaneously? Following the consistency check framework, we measure the performance of forecasters in terms of the consistency of their predict… ▽ More Forecasting is a task that is difficult to evaluate: the ground truth can only be known in the future. Recent work showing LLM forecasters rapidly approaching human-level performance begs the question: how can we benchmark and evaluate these forecasters instantaneously? Following the consistency check framework, we measure the performance of forecasters in terms of the consistency of their predictions on different logically-related questions. We propose a new, general consistency metric based on arbitrage: for example, if a forecasting AI illogically predicts that both the Democratic and Republican parties have 60% probability of winning the 2024 US presidential election, an arbitrageur can trade against the forecaster's predictions and make a profit. We build an automated evaluation system that generates a set of base questions, instantiates consistency checks from these questions, elicits the predictions of the forecaster, and measures the consistency of the predictions. We then build a standard, proper-scoring-rule forecasting benchmark, and show that our (instantaneous) consistency metrics correlate with LLM forecasters' ground truth Brier scores (which are only known in the future). We also release a consistency benchmark that resolves in 2028, providing a long-term evaluation tool for forecasting. △ Less

Submitted 9 January, 2025; v1 submitted 24 December, 2024; originally announced December 2024.

Comments: 55 pages, 25 figures. Submitted to ICLR 2025

arXiv:2402.14021 [pdf, ps, other]

Betting on what is neither verifiable nor falsifiable

Authors: Abhimanyu Pallavi Sudhir, Long Tran-Thanh

Abstract: Prediction markets are useful for estimating probabilities of claims whose truth will be revealed at some fixed time -- this includes questions about the values of real-world events (i.e. statistical uncertainty), and questions about the values of primitive recursive functions (i.e. logical or algorithmic uncertainty). However, they cannot be directly applied to questions without a fixed resolutio… ▽ More Prediction markets are useful for estimating probabilities of claims whose truth will be revealed at some fixed time -- this includes questions about the values of real-world events (i.e. statistical uncertainty), and questions about the values of primitive recursive functions (i.e. logical or algorithmic uncertainty). However, they cannot be directly applied to questions without a fixed resolution criterion, and real-world applications of prediction markets to such questions often amount to predicting not whether a sentence is true, but whether it will be proven. Such questions could be represented by countable unions or intersections of more basic events, or as First-Order-Logic sentences on the Arithmetical Hierarchy (or even beyond FOL, as hyperarithmetical sentences). In this paper, we propose an approach to betting on such events via options, or equivalently as bets on the outcome of a "verification-falsification game". Our work thus acts as an alternative to the existing framework of Garrabrant induction for logical uncertainty, and relates to the stance known as constructivism in the philosophy of mathematics; furthermore it has broader implications for philosophy and mathematical logic. △ Less

Submitted 29 January, 2024; originally announced February 2024.

Comments: 15 pages, 4 figures

MSC Class: 91B26 (Primary); 03F03 (Secondary) ACM Class: F.4.1; I.2.11

arXiv:2303.15218 [pdf, other]

Evaluating XGBoost for Balanced and Imbalanced Data: Application to Fraud Detection

Authors: Gissel Velarde, Anindya Sudhir, Sanjay Deshmane, Anuj Deshmunkh, Khushboo Sharma, Vaibhav Joshi

Abstract: This paper evaluates XGboost's performance given different dataset sizes and class distributions, from perfectly balanced to highly imbalanced. XGBoost has been selected for evaluation, as it stands out in several benchmarks due to its detection performance and speed. After introducing the problem of fraud detection, the paper reviews evaluation metrics for detection systems or binary classifiers,… ▽ More This paper evaluates XGboost's performance given different dataset sizes and class distributions, from perfectly balanced to highly imbalanced. XGBoost has been selected for evaluation, as it stands out in several benchmarks due to its detection performance and speed. After introducing the problem of fraud detection, the paper reviews evaluation metrics for detection systems or binary classifiers, and illustrates with examples how different metrics work for balanced and imbalanced datasets. Then, it examines the principles of XGBoost. It proposes a pipeline for data preparation and compares a Vanilla XGBoost against a random search-tuned XGBoost. Random search fine-tuning provides consistent improvement for large datasets of 100 thousand samples, not so for medium and small datasets of 10 and 1 thousand samples, respectively. Besides, as expected, XGBoost recognition performance improves as more data is available, and deteriorates detection performance as the datasets become more imbalanced. Tests on distributions with 50, 45, 25, and 5 percent positive samples show that the largest drop in detection performance occurs for the distribution with only 5 percent positive samples. Sampling to balance the training set does not provide consistent improvement. Therefore, future work will include a systematic study of different techniques to deal with data imbalance and evaluating other approaches, including graphs, autoencoders, and generative adversarial methods, to deal with the lack of labels. △ Less

Submitted 27 March, 2023; originally announced March 2023.

Comments: 17 pages, 8 figures, 9 tables, Presented at NVIDIA GTC, The Conference for the Era of AI and the Metaverse, March 23, 2023. [S51129]

arXiv:2002.03058 [pdf, other]

Lessons Learned Developing and Extending a Visual Analytics Solution for Investigative Analysis of Scamming Activities

Authors: Ronak Tanna, Shivam Dhar, Ashwin Sudhir, Shreyash Devan, Shubham Verma

Abstract: Cybersecurity analysts work on large communication data sets to perform investigative analysis by painstakingly going over thousands of email conversations to find potential scamming activities and the network of cyber scammers. Traditionally,experts used email clients, database systems and text editors to perform this investigation. With the advent of technology,elaborate tools that summarize dat… ▽ More Cybersecurity analysts work on large communication data sets to perform investigative analysis by painstakingly going over thousands of email conversations to find potential scamming activities and the network of cyber scammers. Traditionally,experts used email clients, database systems and text editors to perform this investigation. With the advent of technology,elaborate tools that summarize data more efficiently by using cutting edge data visualization techniques have come out. Beagle[1] is one such tool which visualizes the large communication data using different panels such that the inspector has better chances of finding the scam network. This paper is a report on our work to implement and improve the work done by Jay Koven et al. [1]. We have proposed and demonstrated via implementation, a few more visualizations that we feel would help in grouping and analyzing the e-mail data more efficiently. Lastly, we have also presented a case study that shows the potential use of our tool in a real-world scenario. △ Less

Submitted 7 February, 2020; originally announced February 2020.

Showing 1–7 of 7 results for author: Sudhir, A