-
SafeArena: Evaluating the Safety of Autonomous Web Agents
Authors:
Ada Defne Tur,
Nicholas Meade,
Xing Han Lù,
Alejandra Zambrano,
Arkil Patel,
Esin Durmus,
Spandana Gella,
Karolina Stańczak,
Siva Reddy
Abstract:
LLM-based agents are becoming increasingly proficient at solving web-based tasks. With this capability comes a greater risk of misuse for malicious purposes, such as posting misinformation in an online forum or selling illicit substances on a website. To evaluate these risks, we propose SafeArena, the first benchmark to focus on the deliberate misuse of web agents. SafeArena comprises 250 safe and…
▽ More
LLM-based agents are becoming increasingly proficient at solving web-based tasks. With this capability comes a greater risk of misuse for malicious purposes, such as posting misinformation in an online forum or selling illicit substances on a website. To evaluate these risks, we propose SafeArena, the first benchmark to focus on the deliberate misuse of web agents. SafeArena comprises 250 safe and 250 harmful tasks across four websites. We classify the harmful tasks into five harm categories -- misinformation, illegal activity, harassment, cybercrime, and social bias, designed to assess realistic misuses of web agents. We evaluate leading LLM-based web agents, including GPT-4o, Claude-3.5 Sonnet, Qwen-2-VL 72B, and Llama-3.2 90B, on our benchmark. To systematically assess their susceptibility to harmful tasks, we introduce the Agent Risk Assessment framework that categorizes agent behavior across four risk levels. We find agents are surprisingly compliant with malicious requests, with GPT-4o and Qwen-2 completing 34.7% and 27.3% of harmful requests, respectively. Our findings highlight the urgent need for safety alignment procedures for web agents. Our benchmark is available here: https://safearena.github.io
△ Less
Submitted 6 March, 2025;
originally announced March 2025.
-
Language Models Largely Exhibit Human-like Constituent Ordering Preferences
Authors:
Ada Defne Tur,
Gaurav Kamath,
Siva Reddy
Abstract:
Though English sentences are typically inflexible vis-à-vis word order, constituents often show far more variability in ordering. One prominent theory presents the notion that constituent ordering is directly correlated with constituent weight: a measure of the constituent's length or complexity. Such theories are interesting in the context of natural language processing (NLP), because while recen…
▽ More
Though English sentences are typically inflexible vis-à-vis word order, constituents often show far more variability in ordering. One prominent theory presents the notion that constituent ordering is directly correlated with constituent weight: a measure of the constituent's length or complexity. Such theories are interesting in the context of natural language processing (NLP), because while recent advances in NLP have led to significant gains in the performance of large language models (LLMs), much remains unclear about how these models process language, and how this compares to human language processing. In particular, the question remains whether LLMs display the same patterns with constituent movement, and may provide insights into existing theories on when and how the shift occurs in human language. We compare a variety of LLMs with diverse properties to evaluate broad LLM performance on four types of constituent movement: heavy NP shift, particle movement, dative alternation, and multiple PPs. Despite performing unexpectedly around particle movement, LLMs generally align with human preferences around constituent ordering.
△ Less
Submitted 14 February, 2025; v1 submitted 8 February, 2025;
originally announced February 2025.
-
ProGRes: Prompted Generative Rescoring on ASR n-Best
Authors:
Ada Defne Tur,
Adel Moumen,
Mirco Ravanelli
Abstract:
Large Language Models (LLMs) have shown their ability to improve the performance of speech recognizers by effectively rescoring the n-best hypotheses generated during the beam search process. However, the best way to exploit recent generative instruction-tuned LLMs for hypothesis rescoring is still unclear. This paper proposes a novel method that uses instruction-tuned LLMs to dynamically expand t…
▽ More
Large Language Models (LLMs) have shown their ability to improve the performance of speech recognizers by effectively rescoring the n-best hypotheses generated during the beam search process. However, the best way to exploit recent generative instruction-tuned LLMs for hypothesis rescoring is still unclear. This paper proposes a novel method that uses instruction-tuned LLMs to dynamically expand the n-best speech recognition hypotheses with new hypotheses generated through appropriately-prompted LLMs. Specifically, we introduce a new zero-shot method for ASR n-best rescoring, which combines confidence scores, LLM sequence scoring, and prompt-based hypothesis generation. We compare Llama-3-Instruct, GPT-3.5 Turbo, and GPT-4 Turbo as prompt-based generators with Llama-3 as sequence scorer LLM. We evaluated our approach using different speech recognizers and observed significant relative improvement in the word error rate (WER) ranging from 5% to 25%.
△ Less
Submitted 8 September, 2024; v1 submitted 30 August, 2024;
originally announced September 2024.
-
Alexa Arena: A User-Centric Interactive Platform for Embodied AI
Authors:
Qiaozi Gao,
Govind Thattai,
Suhaila Shakiah,
Xiaofeng Gao,
Shreyas Pansare,
Vasu Sharma,
Gaurav Sukhatme,
Hangjie Shi,
Bofei Yang,
Desheng Zheng,
Lucy Hu,
Karthika Arumugam,
Shui Hu,
Matthew Wen,
Dinakar Guthy,
Cadence Chung,
Rohan Khanna,
Osman Ipek,
Leslie Ball,
Kate Bland,
Heather Rocker,
Yadunandana Rao,
Michael Johnston,
Reza Ghanadan,
Arindam Mandal
, et al. (2 additional authors not shown)
Abstract:
We introduce Alexa Arena, a user-centric simulation platform for Embodied AI (EAI) research. Alexa Arena provides a variety of multi-room layouts and interactable objects, for the creation of human-robot interaction (HRI) missions. With user-friendly graphics and control mechanisms, Alexa Arena supports the development of gamified robotic tasks readily accessible to general human users, thus openi…
▽ More
We introduce Alexa Arena, a user-centric simulation platform for Embodied AI (EAI) research. Alexa Arena provides a variety of multi-room layouts and interactable objects, for the creation of human-robot interaction (HRI) missions. With user-friendly graphics and control mechanisms, Alexa Arena supports the development of gamified robotic tasks readily accessible to general human users, thus opening a new venue for high-efficiency HRI data collection and EAI system evaluation. Along with the platform, we introduce a dialog-enabled instruction-following benchmark and provide baseline results for it. We make Alexa Arena publicly available to facilitate research in building generalizable and assistive embodied agents.
△ Less
Submitted 7 June, 2023; v1 submitted 2 March, 2023;
originally announced March 2023.
-
Correcting Automated and Manual Speech Transcription Errors using Warped Language Models
Authors:
Mahdi Namazifar,
John Malik,
Li Erran Li,
Gokhan Tur,
Dilek Hakkani Tür
Abstract:
Masked language models have revolutionized natural language processing systems in the past few years. A recently introduced generalization of masked language models called warped language models are trained to be more robust to the types of errors that appear in automatic or manual transcriptions of spoken language by exposing the language model to the same types of errors during training. In this…
▽ More
Masked language models have revolutionized natural language processing systems in the past few years. A recently introduced generalization of masked language models called warped language models are trained to be more robust to the types of errors that appear in automatic or manual transcriptions of spoken language by exposing the language model to the same types of errors during training. In this work we propose a novel approach that takes advantage of the robustness of warped language models to transcription noise for correcting transcriptions of spoken language. We show that our proposed approach is able to achieve up to 10% reduction in word error rates of both automatic and manual transcriptions of spoken language.
△ Less
Submitted 26 March, 2021;
originally announced March 2021.
-
Warped Language Models for Noise Robust Language Understanding
Authors:
Mahdi Namazifar,
Gokhan Tur,
Dilek Hakkani Tür
Abstract:
Masked Language Models (MLM) are self-supervised neural networks trained to fill in the blanks in a given sentence with masked tokens. Despite the tremendous success of MLMs for various text based tasks, they are not robust for spoken language understanding, especially for spontaneous conversational speech recognition noise. In this work we introduce Warped Language Models (WLM) in which input sen…
▽ More
Masked Language Models (MLM) are self-supervised neural networks trained to fill in the blanks in a given sentence with masked tokens. Despite the tremendous success of MLMs for various text based tasks, they are not robust for spoken language understanding, especially for spontaneous conversational speech recognition noise. In this work we introduce Warped Language Models (WLM) in which input sentences at training time go through the same modifications as in MLM, plus two additional modifications, namely inserting and dropping random tokens. These two modifications extend and contract the sentence in addition to the modifications in MLMs, hence the word "warped" in the name. The insertion and drop modification of the input text during training of WLM resemble the types of noise due to Automatic Speech Recognition (ASR) errors, and as a result WLMs are likely to be more robust to ASR noise. Through computational results we show that natural language understanding systems built on top of WLMs perform better compared to those built based on MLMs, especially in the presence of ASR errors.
△ Less
Submitted 3 November, 2020;
originally announced November 2020.
-
Mobile Phone Usage Data for Credit Scoring
Authors:
Henri Ots,
Innar Liiv,
Diana Tur
Abstract:
The aim of this study is to demostrate that mobile phone usage data can be used to make predictions and find the best classification method for credit scoring even if the dataset is small (2,503 customers). We use different classification algorithms to split customers into paying and non-paying ones using mobile data, and then compare the predicted results with actual results. There are several re…
▽ More
The aim of this study is to demostrate that mobile phone usage data can be used to make predictions and find the best classification method for credit scoring even if the dataset is small (2,503 customers). We use different classification algorithms to split customers into paying and non-paying ones using mobile data, and then compare the predicted results with actual results. There are several related works publicly accessible in which mobile data has been used for credit scoring, but they are all based on a large dataset. Small companies are unable to use datasets as large as those used by these related papers, therefore these studies are of little use for them. In this paper we try to argue that there is value in mobile phone usage data for credit scoring even if the dataset is small. We found that with a dataset that consists of mobile data based only on 2,503 customers, we can predict credit risk. The best classification method gave us the result 0.62 AUC (area under the curve).
△ Less
Submitted 28 February, 2020;
originally announced February 2020.