DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments
Authors:
Chiyu Zhang,
Marc-Alexandre Cote,
Michael Albada,
Anush Sankaran,
Jack W. Stokes,
Tong Wang,
Amir Abdi,
William Blum,
Muhammad Abdul-Mageed
Abstract:
Large language model (LLM) agents have shown impressive capabilities in human language comprehension and reasoning, yet their potential in cybersecurity remains underexplored. We introduce DefenderBench, a practical, open-source toolkit for evaluating language agents across offense, defense, and cybersecurity knowledge-based tasks. DefenderBench includes environments for network intrusion, malicio…
▽ More
Large language model (LLM) agents have shown impressive capabilities in human language comprehension and reasoning, yet their potential in cybersecurity remains underexplored. We introduce DefenderBench, a practical, open-source toolkit for evaluating language agents across offense, defense, and cybersecurity knowledge-based tasks. DefenderBench includes environments for network intrusion, malicious content detection, code vulnerability analysis, and cybersecurity knowledge assessment. It is intentionally designed to be affordable and easily accessible for researchers while providing fair and rigorous assessment. We benchmark several state-of-the-art (SoTA) and popular LLMs, including both open- and closed-weight models, using a standardized agentic framework. Our results show that Claude-3.7-sonnet performs best with a DefenderBench score of 81.65, followed by Claude-3.7-sonnet-think with 78.40, while the best open-weight model, Llama 3.3 70B, is not far behind with a DefenderBench score of 71.81. DefenderBench's modular design allows seamless integration of custom LLMs and tasks, promoting reproducibility and fair comparisons. An anonymized version of DefenderBench is available at https://github.com/microsoft/DefenderBench.
△ Less
Submitted 10 June, 2025; v1 submitted 31 May, 2025;
originally announced June 2025.
Predicting Stock Movement with BERTweet and Transformers
Authors:
Michael Charles Albada,
Mojolaoluwa Joshua Sonola
Abstract:
Applying deep learning and computational intelligence to finance has been a popular area of applied research, both within academia and industry, and continues to attract active attention. The inherently high volatility and non-stationary of the data pose substantial challenges to machine learning models, especially so for today's expressive and highly-parameterized deep learning models. Recent wor…
▽ More
Applying deep learning and computational intelligence to finance has been a popular area of applied research, both within academia and industry, and continues to attract active attention. The inherently high volatility and non-stationary of the data pose substantial challenges to machine learning models, especially so for today's expressive and highly-parameterized deep learning models. Recent work has combined natural language processing on data from social media to augment models based purely on historic price data to improve performance has received particular attention. Previous work has achieved state-of-the-art performance on this task by combining techniques such as bidirectional GRUs, variational autoencoders, word and document embeddings, self-attention, graph attention, and adversarial training. In this paper, we demonstrated the efficacy of BERTweet, a variant of BERT pre-trained specifically on a Twitter corpus, and the transformer architecture by achieving competitive performance with the existing literature and setting a new baseline for Matthews Correlation Coefficient on the Stocknet dataset without auxiliary data sources.
△ Less
Submitted 13 March, 2025;
originally announced March 2025.
Predicting Clinical Outcomes with Waveform LSTMs
Authors:
Michael Albada
Abstract:
Data mining and machine learning hold great potential to enable health systems to systematically use data and analytics to identify inefficiencies and best practices that improve care and reduce costs. Waveform data offers particularly detailed information on how patient health evolves over time and has the potential to significantly improve prediction accuracy on multiple benchmarks, but has been…
▽ More
Data mining and machine learning hold great potential to enable health systems to systematically use data and analytics to identify inefficiencies and best practices that improve care and reduce costs. Waveform data offers particularly detailed information on how patient health evolves over time and has the potential to significantly improve prediction accuracy on multiple benchmarks, but has been widely under-utilized, largely because of the challenges in working with these large and complex datasets. This study evaluates the potential of leveraging clinical waveform data to improve prediction accuracy on a single benchmark task: the risk of mortality in the intensive care unit. We identify significant potential from this data, beating the existing baselines for both logistic regression and deep learning models.
△ Less
Submitted 13 March, 2025;
originally announced March 2025.