-
Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge
Authors:
Boyu Gou,
Zanming Huang,
Yuting Ning,
Yu Gu,
Michael Lin,
Weijian Qi,
Andrei Kopanev,
Botao Yu,
Bernal Jiménez Gutiérrez,
Yiheng Shu,
Chan Hee Song,
Jiaman Wu,
Shijie Chen,
Hanane Nour Moussa,
Tianshu Zhang,
Jian Xie,
Yifei Li,
Tianci Xue,
Zeyi Liao,
Kai Zhang,
Boyuan Zheng,
Zhaowei Cai,
Viktor Rozgic,
Morteza Ziyadi,
Huan Sun
, et al. (1 additional authors not shown)
Abstract:
Agentic search such as Deep Research systems, where large language models autonomously browse the web, synthesize information, and return comprehensive citation-backed answers, represents a major shift in how users interact with web-scale information. While promising greater efficiency and cognitive offloading, the growing complexity and open-endedness of agentic search have outpaced existing eval…
▽ More
Agentic search such as Deep Research systems, where large language models autonomously browse the web, synthesize information, and return comprehensive citation-backed answers, represents a major shift in how users interact with web-scale information. While promising greater efficiency and cognitive offloading, the growing complexity and open-endedness of agentic search have outpaced existing evaluation benchmarks and methodologies, which largely assume short search horizons and static answers. In this paper, we introduce Mind2Web 2, a benchmark of 130 realistic, high-quality, and long-horizon tasks that require real-time web browsing and extensive information synthesis, constructed with over 1,000 hours of human labor. To address the challenge of evaluating time-varying and complex answers, we propose a novel Agent-as-a-Judge framework. Our method constructs task-specific judge agents based on a tree-structured rubric design to automatically assess both answer correctness and source attribution. We conduct a comprehensive evaluation of nine frontier agentic search systems and human performance, along with a detailed error analysis to draw insights for future development. The best-performing system, OpenAI Deep Research, can already achieve 50-70% of human performance while spending half the time, showing a great potential. Altogether, Mind2Web 2 provides a rigorous foundation for developing and benchmarking the next generation of agentic search systems.
△ Less
Submitted 26 June, 2025;
originally announced June 2025.
-
The Amazon Nova Family of Models: Technical Report and Model Card
Authors:
Amazon AGI,
Aaron Langford,
Aayush Shah,
Abhanshu Gupta,
Abhimanyu Bhatter,
Abhinav Goyal,
Abhinav Mathur,
Abhinav Mohanty,
Abhishek Kumar,
Abhishek Sethi,
Abi Komma,
Abner Pena,
Achin Jain,
Adam Kunysz,
Adam Opyrchal,
Adarsh Singh,
Aditya Rawal,
Adok Achar Budihal Prasad,
Adrià de Gispert,
Agnika Kumar,
Aishwarya Aryamane,
Ajay Nair,
Akilan M,
Akshaya Iyengar,
Akshaya Vishnu Kudlu Shanbhogue
, et al. (761 additional authors not shown)
Abstract:
We present Amazon Nova, a new generation of state-of-the-art foundation models that deliver frontier intelligence and industry-leading price performance. Amazon Nova Pro is a highly-capable multimodal model with the best combination of accuracy, speed, and cost for a wide range of tasks. Amazon Nova Lite is a low-cost multimodal model that is lightning fast for processing images, video, documents…
▽ More
We present Amazon Nova, a new generation of state-of-the-art foundation models that deliver frontier intelligence and industry-leading price performance. Amazon Nova Pro is a highly-capable multimodal model with the best combination of accuracy, speed, and cost for a wide range of tasks. Amazon Nova Lite is a low-cost multimodal model that is lightning fast for processing images, video, documents and text. Amazon Nova Micro is a text-only model that delivers our lowest-latency responses at very low cost. Amazon Nova Canvas is an image generation model that creates professional grade images with rich customization controls. Amazon Nova Reel is a video generation model offering high-quality outputs, customization, and motion control. Our models were built responsibly and with a commitment to customer trust, security, and reliability. We report benchmarking results for core capabilities, agentic performance, long context, functional adaptation, runtime performance, and human evaluation.
△ Less
Submitted 17 March, 2025;
originally announced June 2025.
-
RARE: Retrieval-Aware Robustness Evaluation for Retrieval-Augmented Generation Systems
Authors:
Yixiao Zeng,
Tianyu Cao,
Danqing Wang,
Xinran Zhao,
Zimeng Qiu,
Morteza Ziyadi,
Tongshuang Wu,
Lei Li
Abstract:
Retrieval-Augmented Generation (RAG) enhances recency and factuality in answers. However, existing evaluations rarely test how well these systems cope with real-world noise, conflicting between internal and external retrieved contexts, or fast-changing facts. We introduce Retrieval-Aware Robustness Evaluation (RARE), a unified framework and large-scale benchmark that jointly stress-tests query and…
▽ More
Retrieval-Augmented Generation (RAG) enhances recency and factuality in answers. However, existing evaluations rarely test how well these systems cope with real-world noise, conflicting between internal and external retrieved contexts, or fast-changing facts. We introduce Retrieval-Aware Robustness Evaluation (RARE), a unified framework and large-scale benchmark that jointly stress-tests query and document perturbations over dynamic, time-sensitive corpora. One of the central features of RARE is a knowledge-graph-driven synthesis pipeline (RARE-Get) that automatically extracts single and multi-hop relations from the customized corpus and generates multi-level question sets without manual intervention. Leveraging this pipeline, we construct a dataset (RARE-Set) spanning 400 expert-level time-sensitive finance, economics, and policy documents and 48,322 questions whose distribution evolves as the underlying sources change. To quantify resilience, we formalize retrieval-conditioned robustness metrics (RARE-Met) that capture a model's ability to remain correct or recover when queries, documents, or real-world retrieval results are systematically altered. Our results show that RAG systems exhibit surprising vulnerability to perturbations, with document robustness consistently being the weakest point regardless of generator size or architecture. RAG systems consistently show lower robustness on multi-hop queries than single-hop queries across all domains.
△ Less
Submitted 31 May, 2025;
originally announced June 2025.
-
RAGPPI: RAG Benchmark for Protein-Protein Interactions in Drug Discovery
Authors:
Youngseung Jeon,
Ziwen Li,
Thomas Li,
JiaSyuan Chang,
Morteza Ziyadi,
Xiang 'Anthony' Chen
Abstract:
Retrieving the biological impacts of protein-protein interactions (PPIs) is essential for target identification (Target ID) in drug development. Given the vast number of proteins involved, this process remains time-consuming and challenging. Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) frameworks have supported Target ID; however, no benchmark currently exists for identify…
▽ More
Retrieving the biological impacts of protein-protein interactions (PPIs) is essential for target identification (Target ID) in drug development. Given the vast number of proteins involved, this process remains time-consuming and challenging. Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) frameworks have supported Target ID; however, no benchmark currently exists for identifying the biological impacts of PPIs. To bridge this gap, we introduce the RAG Benchmark for PPIs (RAGPPI), a factual question-answer benchmark of 4,420 question-answer pairs that focus on the potential biological impacts of PPIs. Through interviews with experts, we identified criteria for a benchmark dataset, such as a type of QA and source. We built a gold-standard dataset (500 QA pairs) through expert-driven data annotation. We developed an ensemble auto-evaluation LLM that reflected expert labeling characteristics, which facilitates the construction of a silver-standard dataset (3,720 QA pairs). We are committed to maintaining RAGPPI as a resource to support the research community in advancing RAG systems for drug discovery QA solutions.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
MAVERIX: Multimodal Audio-Visual Evaluation Reasoning IndeX
Authors:
Liuyue Xie,
George Z. Wei,
Avik Kuthiala,
Ce Zheng,
Ananya Bal,
Mosam Dabhi,
Liting Wen,
Taru Rustagi,
Ethan Lai,
Sushil Khyalia,
Rohan Choudhury,
Morteza Ziyadi,
Xu Zhang,
Hao Yang,
László A. Jeni
Abstract:
Frontier models have either been language-only or have primarily focused on vision and language modalities. Although recent advancements in models with vision and audio understanding capabilities have shown substantial progress, the field lacks a standardized evaluation framework for thoroughly assessing their cross-modality perception performance. We introduce MAVERIX~(Multimodal Audio-Visual Eva…
▽ More
Frontier models have either been language-only or have primarily focused on vision and language modalities. Although recent advancements in models with vision and audio understanding capabilities have shown substantial progress, the field lacks a standardized evaluation framework for thoroughly assessing their cross-modality perception performance. We introduce MAVERIX~(Multimodal Audio-Visual Evaluation Reasoning IndeX), a novel benchmark with 700 videos and 2,556 questions explicitly designed to evaluate multimodal models through tasks that necessitate close integration of video and audio information. MAVERIX uniquely provides models with audiovisual tasks, closely mimicking the multimodal perceptual experiences available to humans during inference and decision-making processes. To our knowledge, MAVERIX is the first benchmark aimed explicitly at assessing comprehensive audiovisual integration. Experiments with state-of-the-art models, including Gemini 1.5 Pro and o1, show performance approaching human levels (around 70% accuracy), while human experts reach near-ceiling performance (95.1%). With standardized evaluation protocols, a rigorously annotated pipeline, and a public toolkit, MAVERIX establishes a challenging testbed for advancing audiovisual multimodal intelligence.
△ Less
Submitted 27 March, 2025;
originally announced March 2025.
-
Certifying Counterfactual Bias in LLMs
Authors:
Isha Chaudhary,
Qian Hu,
Manoj Kumar,
Morteza Ziyadi,
Rahul Gupta,
Gagandeep Singh
Abstract:
Large Language Models (LLMs) can produce biased responses that can cause representational harms. However, conventional studies are insufficient to thoroughly evaluate biases across LLM responses for different demographic groups (a.k.a. counterfactual bias), as they do not scale to large number of inputs and do not provide guarantees. Therefore, we propose the first framework, LLMCert-B that certif…
▽ More
Large Language Models (LLMs) can produce biased responses that can cause representational harms. However, conventional studies are insufficient to thoroughly evaluate biases across LLM responses for different demographic groups (a.k.a. counterfactual bias), as they do not scale to large number of inputs and do not provide guarantees. Therefore, we propose the first framework, LLMCert-B that certifies LLMs for counterfactual bias on distributions of prompts. A certificate consists of high-confidence bounds on the probability of unbiased LLM responses for any set of counterfactual prompts - prompts differing by demographic groups, sampled from a distribution. We illustrate counterfactual bias certification for distributions of counterfactual prompts created by applying prefixes sampled from prefix distributions, to a given set of prompts. We consider prefix distributions consisting random token sequences, mixtures of manual jailbreaks, and perturbations of jailbreaks in LLM's embedding space. We generate non-trivial certificates for SOTA LLMs, exposing their vulnerabilities over distributions of prompts generated from computationally inexpensive prefix distributions.
△ Less
Submitted 21 April, 2025; v1 submitted 29 May, 2024;
originally announced May 2024.
-
Partial Federated Learning
Authors:
Tiantian Feng,
Anil Ramakrishna,
Jimit Majmudar,
Charith Peris,
Jixuan Wang,
Clement Chung,
Richard Zemel,
Morteza Ziyadi,
Rahul Gupta
Abstract:
Federated Learning (FL) is a popular algorithm to train machine learning models on user data constrained to edge devices (for example, mobile phones) due to privacy concerns. Typically, FL is trained with the assumption that no part of the user data can be egressed from the edge. However, in many production settings, specific data-modalities/meta-data are limited to be on device while others are n…
▽ More
Federated Learning (FL) is a popular algorithm to train machine learning models on user data constrained to edge devices (for example, mobile phones) due to privacy concerns. Typically, FL is trained with the assumption that no part of the user data can be egressed from the edge. However, in many production settings, specific data-modalities/meta-data are limited to be on device while others are not. For example, in commercial SLU systems, it is typically desired to prevent transmission of biometric signals (such as audio recordings of the input prompt) to the cloud, but egress of locally (i.e. on the edge device) transcribed text to the cloud may be possible. In this work, we propose a new algorithm called Partial Federated Learning (PartialFL), where a machine learning model is trained using data where a subset of data modalities or their intermediate representations can be made available to the server. We further restrict our model training by preventing the egress of data labels to the cloud for better privacy, and instead use a contrastive learning based model objective. We evaluate our approach on two different multi-modal datasets and show promising results with our proposed approach.
△ Less
Submitted 3 March, 2024;
originally announced March 2024.
-
Reasoning Like Program Executors
Authors:
Xinyu Pi,
Qian Liu,
Bei Chen,
Morteza Ziyadi,
Zeqi Lin,
Qiang Fu,
Yan Gao,
Jian-Guang Lou,
Weizhu Chen
Abstract:
Reasoning over natural language is a long-standing goal for the research community. However, studies have shown that existing language models are inadequate in reasoning. To address the issue, we present POET, a novel reasoning pre-training paradigm. Through pre-training language models with programs and their execution results, POET empowers language models to harvest the reasoning knowledge poss…
▽ More
Reasoning over natural language is a long-standing goal for the research community. However, studies have shown that existing language models are inadequate in reasoning. To address the issue, we present POET, a novel reasoning pre-training paradigm. Through pre-training language models with programs and their execution results, POET empowers language models to harvest the reasoning knowledge possessed by program executors via a data-driven approach. POET is conceptually simple and can be instantiated by different kinds of program executors. In this paper, we showcase two simple instances POET-Math and POET-Logic, in addition to a complex instance, POET-SQL. Experimental results on six benchmarks demonstrate that POET can significantly boost model performance in natural language reasoning, such as numerical reasoning, logical reasoning, and multi-hop reasoning. POET opens a new gate on reasoning-enhancement pre-training, and we hope our analysis would shed light on the future research of reasoning like program executors.
△ Less
Submitted 22 October, 2022; v1 submitted 27 January, 2022;
originally announced January 2022.
-
TAPEX: Table Pre-training via Learning a Neural SQL Executor
Authors:
Qian Liu,
Bei Chen,
Jiaqi Guo,
Morteza Ziyadi,
Zeqi Lin,
Weizhu Chen,
Jian-Guang Lou
Abstract:
Recent progress in language model pre-training has achieved a great success via leveraging large-scale unstructured textual data. However, it is still a challenge to apply pre-training on structured tabular data due to the absence of large-scale high-quality tabular data. In this paper, we propose TAPEX to show that table pre-training can be achieved by learning a neural SQL executor over a synthe…
▽ More
Recent progress in language model pre-training has achieved a great success via leveraging large-scale unstructured textual data. However, it is still a challenge to apply pre-training on structured tabular data due to the absence of large-scale high-quality tabular data. In this paper, we propose TAPEX to show that table pre-training can be achieved by learning a neural SQL executor over a synthetic corpus, which is obtained by automatically synthesizing executable SQL queries and their execution outputs. TAPEX addresses the data scarcity challenge via guiding the language model to mimic a SQL executor on the diverse, large-scale and high-quality synthetic corpus. We evaluate TAPEX on four benchmark datasets. Experimental results demonstrate that TAPEX outperforms previous table pre-training approaches by a large margin and achieves new state-of-the-art results on all of them. This includes the improvements on the weakly-supervised WikiSQL denotation accuracy to 89.5% (+2.3%), the WikiTableQuestions denotation accuracy to 57.5% (+4.8%), the SQA denotation accuracy to 74.5% (+3.5%), and the TabFact accuracy to 84.2% (+3.2%). To our knowledge, this is the first work to exploit table pre-training via synthetic executable programs and to achieve new state-of-the-art results on various downstream tasks. Our code can be found at https://github.com/microsoft/Table-Pretraining.
△ Less
Submitted 14 March, 2022; v1 submitted 15 July, 2021;
originally announced July 2021.
-
Example-Based Named Entity Recognition
Authors:
Morteza Ziyadi,
Yuting Sun,
Abhishek Goswami,
Jade Huang,
Weizhu Chen
Abstract:
We present a novel approach to named entity recognition (NER) in the presence of scarce data that we call example-based NER. Our train-free few-shot learning approach takes inspiration from question-answering to identify entity spans in a new and unseen domain. In comparison with the current state-of-the-art, the proposed method performs significantly better, especially when using a low number of…
▽ More
We present a novel approach to named entity recognition (NER) in the presence of scarce data that we call example-based NER. Our train-free few-shot learning approach takes inspiration from question-answering to identify entity spans in a new and unseen domain. In comparison with the current state-of-the-art, the proposed method performs significantly better, especially when using a low number of support examples.
△ Less
Submitted 24 August, 2020;
originally announced August 2020.
-
MT-BioNER: Multi-task Learning for Biomedical Named Entity Recognition using Deep Bidirectional Transformers
Authors:
Muhammad Raza Khan,
Morteza Ziyadi,
Mohamed AbdelHady
Abstract:
Conversational agents such as Cortana, Alexa and Siri are continuously working on increasing their capabilities by adding new domains. The support of a new domain includes the design and development of a number of NLU components for domain classification, intents classification and slots tagging (including named entity recognition). Each component only performs well when trained on a large amount…
▽ More
Conversational agents such as Cortana, Alexa and Siri are continuously working on increasing their capabilities by adding new domains. The support of a new domain includes the design and development of a number of NLU components for domain classification, intents classification and slots tagging (including named entity recognition). Each component only performs well when trained on a large amount of labeled data. Second, these components are deployed on limited-memory devices which requires some model compression. Third, for some domains such as the health domain, it is hard to find a single training data set that covers all the required slot types. To overcome these mentioned problems, we present a multi-task transformer-based neural architecture for slot tagging. We consider the training of a slot tagger using multiple data sets covering different slot types as a multi-task learning problem. The experimental results on the biomedical domain have shown that the proposed approach outperforms the previous state-of-the-art systems for slot tagging on the different benchmark biomedical datasets in terms of (time and memory) efficiency and effectiveness. The output slot tagger can be used by the conversational agent to better identify entities in the input utterances.
△ Less
Submitted 24 January, 2020;
originally announced January 2020.
-
Spatial Phase and Amplitude Structuring of Beams Using a Combination of Multiple Orthogonal Spatial Functions with Complex Coefficients
Authors:
Guodong Xie,
Cong Liu,
Long Li,
Yongxiong Ren,
Zhe Zhao,
Yan Yan,
Nisar Ahmed,
Zhe Wang,
Asher J. Willner,
Changjing Bao,
Yinwen Cao,
Morteza Ziyadi,
Ahmed Almaiman,
Solyman Ashrafi,
Moshe Tur,
Alan E. Willner
Abstract:
Analogous to time signals that can be composed of multiple frequency functions, we use uniquely structured orthogonal spatial modes to create different beam shapes. We tailor the spatial structure by judiciously choosing a weighted combination of multiple modal states within an orthogonal basis set, and we can tunably create beam phase and intensity "shapes" that are not otherwise readily achievab…
▽ More
Analogous to time signals that can be composed of multiple frequency functions, we use uniquely structured orthogonal spatial modes to create different beam shapes. We tailor the spatial structure by judiciously choosing a weighted combination of multiple modal states within an orthogonal basis set, and we can tunably create beam phase and intensity "shapes" that are not otherwise readily achievable. As an example shape, we use a series of orbital-angular-momentum (OAM) functions with adjustable complex weights to create a reconfigurable spatial region of higher localized power as compared to traditional beam combining. We simulate a structured beam created by coherently combining several orthogonal OAM beams with different complex weights, and we achieve a >10X localized power density enhancement with 19 beams. Additionally, we can create unique shapes by passing a single beam through a specially designed phase and intensity mask that contains the combination of multiple OAM functions each with complex weights. Using this approach, we experimentally demonstrate a ~2.5X localized power density increase when utilizing 9 functions.
△ Less
Submitted 28 May, 2016;
originally announced May 2016.
-
Heteroscedastic Relevance Vector Machine
Authors:
Daniel Khashabi,
Mojtaba Ziyadi,
Feng Liang
Abstract:
In this work we propose a heteroscedastic generalization to RVM, a fast Bayesian framework for regression, based on some recent similar works. We use variational approximation and expectation propagation to tackle the problem. The work is still under progress and we are examining the results and comparing with the previous works.
In this work we propose a heteroscedastic generalization to RVM, a fast Bayesian framework for regression, based on some recent similar works. We use variational approximation and expectation propagation to tackle the problem. The work is still under progress and we are examining the results and comparing with the previous works.
△ Less
Submitted 9 January, 2013;
originally announced January 2013.