-
Tree Boosting Methods for Balanced andImbalanced Classification and their Robustness Over Time in Risk Assessment
Authors:
Gissel Velarde,
Michael Weichert,
Anuj Deshmunkh,
Sanjay Deshmane,
Anindya Sudhir,
Khushboo Sharma,
Vaibhav Joshi
Abstract:
Most real-world classification problems deal with imbalanced datasets, posing a challenge for Artificial Intelligence (AI), i.e., machine learning algorithms, because the minority class, which is of extreme interest, often proves difficult to be detected. This paper empirically evaluates tree boosting methods' performance given different dataset sizes and class distributions, from perfectly balanc…
▽ More
Most real-world classification problems deal with imbalanced datasets, posing a challenge for Artificial Intelligence (AI), i.e., machine learning algorithms, because the minority class, which is of extreme interest, often proves difficult to be detected. This paper empirically evaluates tree boosting methods' performance given different dataset sizes and class distributions, from perfectly balanced to highly imbalanced. For tabular data, tree-based methods such as XGBoost, stand out in several benchmarks due to detection performance and speed. Therefore, XGBoost and Imbalance-XGBoost are evaluated. After introducing the motivation to address risk assessment with machine learning, the paper reviews evaluation metrics for detection systems or binary classifiers. It proposes a method for data preparation followed by tree boosting methods including hyper-parameter optimization. The method is evaluated on private datasets of 1 thousand (K), 10K and 100K samples on distributions with 50, 45, 25, and 5 percent positive samples. As expected, the developed method increases its recognition performance as more data is given for training and the F1 score decreases as the data distribution becomes more imbalanced, but it is still significantly superior to the baseline of precision-recall determined by the ratio of positives divided by positives and negatives. Sampling to balance the training set does not provide consistent improvement and deteriorates detection. In contrast, classifier hyper-parameter optimization improves recognition, but should be applied carefully depending on data volume and distribution. Finally, the developed method is robust to data variation over time up to some point. Retraining can be used when performance starts deteriorating.
△ Less
Submitted 25 April, 2025;
originally announced April 2025.
-
A Benchmark for Scalable Oversight Protocols
Authors:
Abhimanyu Pallavi Sudhir,
Jackson Kaunismaa,
Arjun Panickssery
Abstract:
As AI agents surpass human capabilities, scalable oversight -- the problem of effectively supplying human feedback to potentially superhuman AI models -- becomes increasingly critical to ensure alignment. While numerous scalable oversight protocols have been proposed, they lack a systematic empirical framework to evaluate and compare them. While recent works have tried to empirically study scalabl…
▽ More
As AI agents surpass human capabilities, scalable oversight -- the problem of effectively supplying human feedback to potentially superhuman AI models -- becomes increasingly critical to ensure alignment. While numerous scalable oversight protocols have been proposed, they lack a systematic empirical framework to evaluate and compare them. While recent works have tried to empirically study scalable oversight protocols -- particularly Debate -- we argue that the experiments they conduct are not generalizable to other protocols. We introduce the scalable oversight benchmark, a principled framework for evaluating human feedback mechanisms based on our agent score difference (ASD) metric, a measure of how effectively a mechanism advantages truth-telling over deception. We supply a Python package to facilitate rapid and competitive evaluation of scalable oversight protocols on our benchmark, and conduct a demonstrative experiment benchmarking Debate.
△ Less
Submitted 31 March, 2025;
originally announced April 2025.
-
Market-based Architectures in RL and Beyond
Authors:
Abhimanyu Pallavi Sudhir,
Long Tran-Thanh
Abstract:
Market-based agents refer to reinforcement learning agents which determine their actions based on an internal market of sub-agents. We introduce a new type of market-based algorithm where the state itself is factored into several axes called ``goods'', which allows for greater specialization and parallelism than existing market-based RL algorithms. Furthermore, we argue that market-based algorithm…
▽ More
Market-based agents refer to reinforcement learning agents which determine their actions based on an internal market of sub-agents. We introduce a new type of market-based algorithm where the state itself is factored into several axes called ``goods'', which allows for greater specialization and parallelism than existing market-based RL algorithms. Furthermore, we argue that market-based algorithms have the potential to address many current challenges in AI, such as search, dynamic scaling and complete feedback, and demonstrate that they may be seen to generalize neural networks; finally, we list some novel ways that market algorithms may be applied in conjunction with Large Language Models for immediate practical applicability.
△ Less
Submitted 5 March, 2025;
originally announced March 2025.
-
Consistency Checks for Language Model Forecasters
Authors:
Daniel Paleka,
Abhimanyu Pallavi Sudhir,
Alejandro Alvarez,
Vineeth Bhat,
Adam Shen,
Evan Wang,
Florian Tramèr
Abstract:
Forecasting is a task that is difficult to evaluate: the ground truth can only be known in the future. Recent work showing LLM forecasters rapidly approaching human-level performance begs the question: how can we benchmark and evaluate these forecasters instantaneously? Following the consistency check framework, we measure the performance of forecasters in terms of the consistency of their predict…
▽ More
Forecasting is a task that is difficult to evaluate: the ground truth can only be known in the future. Recent work showing LLM forecasters rapidly approaching human-level performance begs the question: how can we benchmark and evaluate these forecasters instantaneously? Following the consistency check framework, we measure the performance of forecasters in terms of the consistency of their predictions on different logically-related questions. We propose a new, general consistency metric based on arbitrage: for example, if a forecasting AI illogically predicts that both the Democratic and Republican parties have 60% probability of winning the 2024 US presidential election, an arbitrageur can trade against the forecaster's predictions and make a profit. We build an automated evaluation system that generates a set of base questions, instantiates consistency checks from these questions, elicits the predictions of the forecaster, and measures the consistency of the predictions. We then build a standard, proper-scoring-rule forecasting benchmark, and show that our (instantaneous) consistency metrics correlate with LLM forecasters' ground truth Brier scores (which are only known in the future). We also release a consistency benchmark that resolves in 2028, providing a long-term evaluation tool for forecasting.
△ Less
Submitted 9 January, 2025; v1 submitted 24 December, 2024;
originally announced December 2024.
-
Betting on what is neither verifiable nor falsifiable
Authors:
Abhimanyu Pallavi Sudhir,
Long Tran-Thanh
Abstract:
Prediction markets are useful for estimating probabilities of claims whose truth will be revealed at some fixed time -- this includes questions about the values of real-world events (i.e. statistical uncertainty), and questions about the values of primitive recursive functions (i.e. logical or algorithmic uncertainty). However, they cannot be directly applied to questions without a fixed resolutio…
▽ More
Prediction markets are useful for estimating probabilities of claims whose truth will be revealed at some fixed time -- this includes questions about the values of real-world events (i.e. statistical uncertainty), and questions about the values of primitive recursive functions (i.e. logical or algorithmic uncertainty). However, they cannot be directly applied to questions without a fixed resolution criterion, and real-world applications of prediction markets to such questions often amount to predicting not whether a sentence is true, but whether it will be proven. Such questions could be represented by countable unions or intersections of more basic events, or as First-Order-Logic sentences on the Arithmetical Hierarchy (or even beyond FOL, as hyperarithmetical sentences). In this paper, we propose an approach to betting on such events via options, or equivalently as bets on the outcome of a "verification-falsification game". Our work thus acts as an alternative to the existing framework of Garrabrant induction for logical uncertainty, and relates to the stance known as constructivism in the philosophy of mathematics; furthermore it has broader implications for philosophy and mathematical logic.
△ Less
Submitted 29 January, 2024;
originally announced February 2024.
-
Evaluating XGBoost for Balanced and Imbalanced Data: Application to Fraud Detection
Authors:
Gissel Velarde,
Anindya Sudhir,
Sanjay Deshmane,
Anuj Deshmunkh,
Khushboo Sharma,
Vaibhav Joshi
Abstract:
This paper evaluates XGboost's performance given different dataset sizes and class distributions, from perfectly balanced to highly imbalanced. XGBoost has been selected for evaluation, as it stands out in several benchmarks due to its detection performance and speed. After introducing the problem of fraud detection, the paper reviews evaluation metrics for detection systems or binary classifiers,…
▽ More
This paper evaluates XGboost's performance given different dataset sizes and class distributions, from perfectly balanced to highly imbalanced. XGBoost has been selected for evaluation, as it stands out in several benchmarks due to its detection performance and speed. After introducing the problem of fraud detection, the paper reviews evaluation metrics for detection systems or binary classifiers, and illustrates with examples how different metrics work for balanced and imbalanced datasets. Then, it examines the principles of XGBoost. It proposes a pipeline for data preparation and compares a Vanilla XGBoost against a random search-tuned XGBoost. Random search fine-tuning provides consistent improvement for large datasets of 100 thousand samples, not so for medium and small datasets of 10 and 1 thousand samples, respectively. Besides, as expected, XGBoost recognition performance improves as more data is available, and deteriorates detection performance as the datasets become more imbalanced. Tests on distributions with 50, 45, 25, and 5 percent positive samples show that the largest drop in detection performance occurs for the distribution with only 5 percent positive samples. Sampling to balance the training set does not provide consistent improvement. Therefore, future work will include a systematic study of different techniques to deal with data imbalance and evaluating other approaches, including graphs, autoencoders, and generative adversarial methods, to deal with the lack of labels.
△ Less
Submitted 27 March, 2023;
originally announced March 2023.
-
Lessons Learned Developing and Extending a Visual Analytics Solution for Investigative Analysis of Scamming Activities
Authors:
Ronak Tanna,
Shivam Dhar,
Ashwin Sudhir,
Shreyash Devan,
Shubham Verma
Abstract:
Cybersecurity analysts work on large communication data sets to perform investigative analysis by painstakingly going over thousands of email conversations to find potential scamming activities and the network of cyber scammers. Traditionally,experts used email clients, database systems and text editors to perform this investigation. With the advent of technology,elaborate tools that summarize dat…
▽ More
Cybersecurity analysts work on large communication data sets to perform investigative analysis by painstakingly going over thousands of email conversations to find potential scamming activities and the network of cyber scammers. Traditionally,experts used email clients, database systems and text editors to perform this investigation. With the advent of technology,elaborate tools that summarize data more efficiently by using cutting edge data visualization techniques have come out. Beagle[1] is one such tool which visualizes the large communication data using different panels such that the inspector has better chances of finding the scam network. This paper is a report on our work to implement and improve the work done by Jay Koven et al. [1]. We have proposed and demonstrated via implementation, a few more visualizations that we feel would help in grouping and analyzing the e-mail data more efficiently. Lastly, we have also presented a case study that shows the potential use of our tool in a real-world scenario.
△ Less
Submitted 7 February, 2020;
originally announced February 2020.