-
SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling
Authors:
Md Imbesat Hassan Rizvi,
Xiaodan Zhu,
Iryna Gurevych
Abstract:
Process or step-wise supervision has played a crucial role in advancing complex multi-step reasoning capabilities of Large Language Models (LLMs). However, efficient, high-quality automated process annotation remains a significant challenge. To address this, we introduce Single-Pass Annotation with Reference-Guided Evaluation (SPARE), a novel structured framework that enables single-pass, per-step…
▽ More
Process or step-wise supervision has played a crucial role in advancing complex multi-step reasoning capabilities of Large Language Models (LLMs). However, efficient, high-quality automated process annotation remains a significant challenge. To address this, we introduce Single-Pass Annotation with Reference-Guided Evaluation (SPARE), a novel structured framework that enables single-pass, per-step annotation by aligning each solution step to one or multiple steps in a reference solution, accompanied by explicit reasoning for evaluation. We show that reference-guided step-level evaluation effectively facilitates process supervision on four datasets spanning three domains: mathematical reasoning, multi-hop compositional question answering, and spatial reasoning. We demonstrate that SPARE, when compared to baselines, improves reasoning performance when used for: (1) fine-tuning models in an offline RL setup for inference-time greedy-decoding, and (2) training reward models for ranking/aggregating multiple LLM-generated outputs. Additionally, SPARE achieves competitive performance on challenging mathematical datasets while offering 2.6 times greater efficiency, requiring only 38% of the runtime, compared to tree search-based automatic annotation. The codebase, along with a trained SPARE-PRM model, is publicly released to facilitate further research and reproducibility.
△ Less
Submitted 18 June, 2025;
originally announced June 2025.
-
A Unifying Bias-aware Multidisciplinary Framework for Investigating Socio-Technical Issues
Authors:
Sacha Hasan,
Mehdi Rizvi,
Yingfang Yuan,
Kefan Chen,
Lynne Baillie,
Wei Pang
Abstract:
This paper aims to bring together the disciplines of social science (SS) and computer science (CS) in the design and implementation of a novel multidisciplinary framework for systematic, transparent, ethically-informed, and bias-aware investigation of socio-technical issues. For this, various analysis approaches from social science and machine learning (ML) were applied in a structured sequence to…
▽ More
This paper aims to bring together the disciplines of social science (SS) and computer science (CS) in the design and implementation of a novel multidisciplinary framework for systematic, transparent, ethically-informed, and bias-aware investigation of socio-technical issues. For this, various analysis approaches from social science and machine learning (ML) were applied in a structured sequence to arrive at an original methodology of identifying and quantifying objects of inquiry. A core feature of this framework is that it highlights where bias occurs and suggests possible steps to mitigate it. This is to improve the robustness, reliability, and explainability of the framework and its results. Such an approach also ensures that the investigation of socio-technical issues is transparent about its own limitations and potential sources of bias. To test our framework, we utilised it in the multidisciplinary investigation of the online harms encountered by minoritised ethnic (ME) communities when accessing and using digitalised social housing services in the UK. We draw our findings from 100 interviews with ME individuals in four cities across the UK to understand ME vulnerabilities when accessing and using digitalised social housing services. In our framework, a sub-sample of interviews focusing on ME individuals residing in social housing units were inductively coded. This resulted in the identification of the topics of discrimination, digital poverty, lack of digital literacy, and lack of English proficiency as key vulnerabilities of ME communities. Further ML techniques such as Topic Modelling and Sentiment Analysis were used within our framework where we found that Black African communities are more likely to experience these vulnerabilities in the access, use and outcome of digitalised social housing services.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
An Analysis of Decoding Methods for LLM-based Agents for Faithful Multi-Hop Question Answering
Authors:
Alexander Murphy,
Mohd Sanad Zaki Rizvi,
Aden Haussmann,
Ping Nie,
Guifu Liu,
Aryo Pradipta Gema,
Pasquale Minervini
Abstract:
Large Language Models (LLMs) frequently produce factually inaccurate outputs - a phenomenon known as hallucination - which limits their accuracy in knowledge-intensive NLP tasks. Retrieval-augmented generation and agentic frameworks such as Reasoning and Acting (ReAct) can address this issue by giving the model access to external knowledge. However, LLMs often fail to remain faithful to retrieved…
▽ More
Large Language Models (LLMs) frequently produce factually inaccurate outputs - a phenomenon known as hallucination - which limits their accuracy in knowledge-intensive NLP tasks. Retrieval-augmented generation and agentic frameworks such as Reasoning and Acting (ReAct) can address this issue by giving the model access to external knowledge. However, LLMs often fail to remain faithful to retrieved information. Mitigating this is critical, especially if LLMs are required to reason about the retrieved information. Recent research has explored training-free decoding strategies to improve the faithfulness of model generations. We present a systematic analysis of how the combination of the ReAct framework and decoding strategies (i.e., DeCoRe, DoLa, and CAD) can influence the faithfulness of LLM-generated answers. Our results show that combining an agentic framework for knowledge retrieval with decoding methods that enhance faithfulness can increase accuracy on the downstream Multi-Hop Question Answering tasks. For example, we observe an F1 increase from 19.5 to 32.6 on HotpotQA when using ReAct and DoLa.
△ Less
Submitted 30 March, 2025;
originally announced March 2025.
-
Text Classification using Graph Convolutional Networks: A Comprehensive Survey
Authors:
Syed Mustafa Haider Rizvi,
Ramsha Imran,
Arif Mahmood
Abstract:
Text classification is a quintessential and practical problem in natural language processing with applications in diverse domains such as sentiment analysis, fake news detection, medical diagnosis, and document classification. A sizable body of recent works exists where researchers have studied and tackled text classification from different angles with varying degrees of success. Graph convolution…
▽ More
Text classification is a quintessential and practical problem in natural language processing with applications in diverse domains such as sentiment analysis, fake news detection, medical diagnosis, and document classification. A sizable body of recent works exists where researchers have studied and tackled text classification from different angles with varying degrees of success. Graph convolution network (GCN)-based approaches have gained a lot of traction in this domain over the last decade with many implementations achieving state-of-the-art performance in more recent literature and thus, warranting the need for an updated survey. This work aims to summarize and categorize various GCN-based Text Classification approaches with regard to the architecture and mode of supervision. It identifies their strengths and limitations and compares their performance on various benchmark datasets. We also discuss future research directions and the challenges that exist in this domain.
△ Less
Submitted 12 October, 2024;
originally announced October 2024.
-
A Literature Review of Keyword Spotting Technologies for Urdu
Authors:
Syed Muhammad Aqdas Rizvi
Abstract:
This literature review surveys the advancements of keyword spotting (KWS) technologies, specifically focusing on Urdu, Pakistan's low-resource language (LRL), which has complex phonetics. Despite the global strides in speech technology, Urdu presents unique challenges requiring more tailored solutions. The review traces the evolution from foundational Gaussian Mixture Models to sophisticated neura…
▽ More
This literature review surveys the advancements of keyword spotting (KWS) technologies, specifically focusing on Urdu, Pakistan's low-resource language (LRL), which has complex phonetics. Despite the global strides in speech technology, Urdu presents unique challenges requiring more tailored solutions. The review traces the evolution from foundational Gaussian Mixture Models to sophisticated neural architectures like deep neural networks and transformers, highlighting significant milestones such as integrating multi-task learning and self-supervised approaches that leverage unlabeled data. It examines emerging technologies' role in enhancing KWS systems' performance within multilingual and resource-constrained settings, emphasizing the need for innovations that cater to languages like Urdu. Thus, this review underscores the need for context-specific research addressing the inherent complexities of Urdu and similar URLs and the means of regions communicating through such languages for a more inclusive approach to speech technology.
△ Less
Submitted 16 September, 2024;
originally announced September 2024.
-
Quantifying the Cross-sectoral Intersecting Discrepancies within Multiple Groups Using Latent Class Analysis Towards Fairness
Authors:
Yingfang Yuan,
Kefan Chen,
Mehdi Rizvi,
Lynne Baillie,
Wei Pang
Abstract:
The growing interest in fair AI development is evident. The ''Leave No One Behind'' initiative urges us to address multiple and intersecting forms of inequality in accessing services, resources, and opportunities, emphasising the significance of fairness in AI. This is particularly relevant as an increasing number of AI tools are applied to decision-making processes, such as resource allocation an…
▽ More
The growing interest in fair AI development is evident. The ''Leave No One Behind'' initiative urges us to address multiple and intersecting forms of inequality in accessing services, resources, and opportunities, emphasising the significance of fairness in AI. This is particularly relevant as an increasing number of AI tools are applied to decision-making processes, such as resource allocation and service scheme development, across various sectors such as health, energy, and housing. Therefore, exploring joint inequalities in these sectors is significant and valuable for thoroughly understanding overall inequality and unfairness. This research introduces an innovative approach to quantify cross-sectoral intersecting discrepancies among user-defined groups using latent class analysis. These discrepancies can be used to approximate inequality and provide valuable insights to fairness issues. We validate our approach using both proprietary and public datasets, including both EVENS and Census 2021 (England & Wales) datasets, to examine cross-sectoral intersecting discrepancies among different ethnic groups. We also verify the reliability of the quantified discrepancy by conducting a correlation analysis with a government public metric. Our findings reveal significant discrepancies both among minority ethnic groups and between minority ethnic groups and non-minority ethnic groups, emphasising the need for targeted interventions in policy-making processes. Furthermore, we demonstrate how the proposed approach can provide valuable insights into ensuring fairness in machine learning systems.
△ Less
Submitted 3 July, 2025; v1 submitted 24 May, 2024;
originally announced July 2024.
-
SpaRC and SpaRP: Spatial Reasoning Characterization and Path Generation for Understanding Spatial Reasoning Capability of Large Language Models
Authors:
Md Imbesat Hassan Rizvi,
Xiaodan Zhu,
Iryna Gurevych
Abstract:
Spatial reasoning is a crucial component of both biological and artificial intelligence. In this work, we present a comprehensive study of the capability of current state-of-the-art large language models (LLMs) on spatial reasoning. To support our study, we created and contribute a novel Spatial Reasoning Characterization (SpaRC) framework and Spatial Reasoning Paths (SpaRP) datasets, to enable an…
▽ More
Spatial reasoning is a crucial component of both biological and artificial intelligence. In this work, we present a comprehensive study of the capability of current state-of-the-art large language models (LLMs) on spatial reasoning. To support our study, we created and contribute a novel Spatial Reasoning Characterization (SpaRC) framework and Spatial Reasoning Paths (SpaRP) datasets, to enable an in-depth understanding of the spatial relations and compositions as well as the usefulness of spatial reasoning chains. We found that all the state-of-the-art LLMs do not perform well on the datasets -- their performances are consistently low across different setups. The spatial reasoning capability improves substantially as model sizes scale up. Finetuning both large language models (e.g., Llama-2-70B) and smaller ones (e.g., Llama-2-13B) can significantly improve their F1-scores by 7--32 absolute points. We also found that the top proprietary LLMs still significantly outperform their open-source counterparts in topological spatial understanding and reasoning.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
Simulating Weighted Automata over Sequences and Trees with Transformers
Authors:
Michael Rizvi,
Maude Lizaire,
Clara Lacroce,
Guillaume Rabusseau
Abstract:
Transformers are ubiquitous models in the natural language processing (NLP) community and have shown impressive empirical successes in the past few years. However, little is understood about how they reason and the limits of their computational capabilities. These models do not process data sequentially, and yet outperform sequential neural models such as RNNs. Recent work has shown that these mod…
▽ More
Transformers are ubiquitous models in the natural language processing (NLP) community and have shown impressive empirical successes in the past few years. However, little is understood about how they reason and the limits of their computational capabilities. These models do not process data sequentially, and yet outperform sequential neural models such as RNNs. Recent work has shown that these models can compactly simulate the sequential reasoning abilities of deterministic finite automata (DFAs). This leads to the following question: can transformers simulate the reasoning of more complex finite state machines? In this work, we show that transformers can simulate weighted finite automata (WFAs), a class of models which subsumes DFAs, as well as weighted tree automata (WTA), a generalization of weighted automata to tree structured inputs. We prove these claims formally and provide upper bounds on the sizes of the transformer models needed as a function of the number of states the target automata. Empirically, we perform synthetic experiments showing that transformers are able to learn these compact solutions via standard gradient-based training.
△ Less
Submitted 12 March, 2024;
originally announced March 2024.
-
Defending Root DNS Servers Against DDoS Using Layered Defenses
Authors:
A S M Rizvi,
Jelena Mirkovic,
John Heidemann,
Wesley Hardaker,
Robert Story
Abstract:
Distributed Denial-of-Service (DDoS) attacks exhaust resources, leaving a server unavailable to legitimate clients. The Domain Name System (DNS) is a frequent target of DDoS attacks. Since DNS is a critical infrastructure service, protecting it from DoS is imperative. Many prior approaches have focused on specific filters or anti-spoofing techniques to protect generic services. DNS root nameserver…
▽ More
Distributed Denial-of-Service (DDoS) attacks exhaust resources, leaving a server unavailable to legitimate clients. The Domain Name System (DNS) is a frequent target of DDoS attacks. Since DNS is a critical infrastructure service, protecting it from DoS is imperative. Many prior approaches have focused on specific filters or anti-spoofing techniques to protect generic services. DNS root nameservers are more challenging to protect, since they use fixed IP addresses, serve very diverse clients and requests, receive predominantly UDP traffic that can be spoofed, and must guarantee high quality of service. In this paper we propose a layered DDoS defense for DNS root nameservers. Our defense uses a library of defensive filters, which can be optimized for different attack types, with different levels of selectivity. We further propose a method that automatically and continuously evaluates and selects the best combination of filters throughout the attack. We show that this layered defense approach provides exceptional protection against all attack types using traces of ten real attacks from a DNS root nameserver. Our automated system can select the best defense within seconds and quickly reduces traffic to the server within a manageable range, while keeping collateral damage lower than 2%. We can handle millions of filtering rules without noticeable operational overhead.
△ Less
Submitted 15 September, 2022;
originally announced September 2022.
-
An AI tool for automated analysis of large-scale unstructured clinical cine CMR databases
Authors:
Jorge Mariscal-Harana,
Clint Asher,
Vittoria Vergani,
Maleeha Rizvi,
Louise Keehn,
Raymond J. Kim,
Robert M. Judd,
Steffen E. Petersen,
Reza Razavi,
Andrew King,
Bram Ruijsink,
Esther Puyol-Antón
Abstract:
Artificial intelligence (AI) techniques have been proposed for automating analysis of short axis (SAX) cine cardiac magnetic resonance (CMR), but no CMR analysis tool exists to automatically analyse large (unstructured) clinical CMR datasets. We develop and validate a robust AI tool for start-to-end automatic quantification of cardiac function from SAX cine CMR in large clinical databases. Our pip…
▽ More
Artificial intelligence (AI) techniques have been proposed for automating analysis of short axis (SAX) cine cardiac magnetic resonance (CMR), but no CMR analysis tool exists to automatically analyse large (unstructured) clinical CMR datasets. We develop and validate a robust AI tool for start-to-end automatic quantification of cardiac function from SAX cine CMR in large clinical databases. Our pipeline for processing and analysing CMR databases includes automated steps to identify the correct data, robust image pre-processing, an AI algorithm for biventricular segmentation of SAX CMR and estimation of functional biomarkers, and automated post-analysis quality control to detect and correct errors. The segmentation algorithm was trained on 2793 CMR scans from two NHS hospitals and validated on additional cases from this dataset (n=414) and five external datasets (n=6888), including scans of patients with a range of diseases acquired at 12 different centres using CMR scanners from all major vendors. Median absolute errors in cardiac biomarkers were within the range of inter-observer variability: <8.4mL (left ventricle volume), <9.2mL (right ventricle volume), <13.3g (left ventricular mass), and <5.9% (ejection fraction) across all datasets. Stratification of cases according to phenotypes of cardiac disease and scanner vendors showed good performance across all groups. We show that our proposed tool, which combines image pre-processing steps, a domain-generalisable AI algorithm trained on a large-scale multi-domain CMR dataset and quality control steps, allows robust analysis of (clinical or research) databases from multiple centres, vendors, and cardiac diseases. This enables translation of our tool for use in fully-automated processing of large multi-centre databases.
△ Less
Submitted 5 July, 2023; v1 submitted 15 June, 2022;
originally announced June 2022.
-
Identifying causal relations in tweets using deep learning: Use case on diabetes-related tweets from 2017-2021
Authors:
Adrian Ahne,
Vivek Khetan,
Xavier Tannier,
Md Imbessat Hassan Rizvi,
Thomas Czernichow,
Francisco Orchard,
Charline Bour,
Andrew Fano,
Guy Fagherazzi
Abstract:
Objective: Leveraging machine learning methods, we aim to extract both explicit and implicit cause-effect associations in patient-reported, diabetes-related tweets and provide a tool to better understand opinion, feelings and observations shared within the diabetes online community from a causality perspective. Materials and Methods: More than 30 million diabetes-related tweets in English were col…
▽ More
Objective: Leveraging machine learning methods, we aim to extract both explicit and implicit cause-effect associations in patient-reported, diabetes-related tweets and provide a tool to better understand opinion, feelings and observations shared within the diabetes online community from a causality perspective. Materials and Methods: More than 30 million diabetes-related tweets in English were collected between April 2017 and January 2021. Deep learning and natural language processing methods were applied to focus on tweets with personal and emotional content. A cause-effect-tweet dataset was manually labeled and used to train 1) a fine-tuned Bertweet model to detect causal sentences containing a causal association 2) a CRF model with BERT based features to extract possible cause-effect associations. Causes and effects were clustered in a semi-supervised approach and visualised in an interactive cause-effect-network. Results: Causal sentences were detected with a recall of 68% in an imbalanced dataset. A CRF model with BERT based features outperformed a fine-tuned BERT model for cause-effect detection with a macro recall of 68%. This led to 96,676 sentences with cause-effect associations. "Diabetes" was identified as the central cluster followed by "Death" and "Insulin". Insulin pricing related causes were frequently associated with "Death". Conclusions: A novel methodology was developed to detect causal sentences and identify both explicit and implicit, single and multi-word cause and corresponding effect as expressed in diabetes-related tweets leveraging BERT-based architectures and visualised as cause-effect-network. Extracting causal associations on real-life, patient reported outcomes in social media data provides a useful complementary source of information in diabetes research.
△ Less
Submitted 24 February, 2022; v1 submitted 1 November, 2021;
originally announced November 2021.
-
MIMICause: Representation and automatic extraction of causal relation types from clinical notes
Authors:
Vivek Khetan,
Md Imbesat Hassan Rizvi,
Jessica Huber,
Paige Bartusiak,
Bogdan Sacaleanu,
Andrew Fano
Abstract:
Understanding causal narratives communicated in clinical notes can help make strides towards personalized healthcare. Extracted causal information from clinical notes can be combined with structured EHR data such as patients' demographics, diagnoses, and medications. This will enhance healthcare providers' ability to identify aspects of a patient's story communicated in the clinical notes and help…
▽ More
Understanding causal narratives communicated in clinical notes can help make strides towards personalized healthcare. Extracted causal information from clinical notes can be combined with structured EHR data such as patients' demographics, diagnoses, and medications. This will enhance healthcare providers' ability to identify aspects of a patient's story communicated in the clinical notes and help make more informed decisions.
In this work, we propose annotation guidelines, develop an annotated corpus and provide baseline scores to identify types and direction of causal relations between a pair of biomedical concepts in clinical notes; communicated implicitly or explicitly, identified either in a single sentence or across multiple sentences.
We annotate a total of 2714 de-identified examples sampled from the 2018 n2c2 shared task dataset and train four different language model based architectures. Annotation based on our guidelines achieved a high inter-annotator agreement i.e. Fleiss' kappa ($κ$) score of 0.72, and our model for identification of causal relations achieved a macro F1 score of 0.56 on the test data. The high inter-annotator agreement for clinical text shows the quality of our annotation guidelines while the provided baseline F1 score sets the direction for future research towards understanding narratives in clinical texts.
△ Less
Submitted 13 March, 2022; v1 submitted 13 October, 2021;
originally announced October 2021.
-
Anycast Agility: Network Playbooks to Fight DDoS
Authors:
A S M Rizvi,
Leandro Bertholdo,
Joao Ceron,
John Heidemann
Abstract:
IP anycast is used for services such as DNS and Content Delivery Networks (CDN) to provide the capacity to handle Distributed Denial-of-Service (DDoS) attacks. During a DDoS attack service operators redistribute traffic between anycast sites to take advantage of sites with unused or greater capacity. Depending on site traffic and attack size, operators may instead concentrate attackers in a few si…
▽ More
IP anycast is used for services such as DNS and Content Delivery Networks (CDN) to provide the capacity to handle Distributed Denial-of-Service (DDoS) attacks. During a DDoS attack service operators redistribute traffic between anycast sites to take advantage of sites with unused or greater capacity. Depending on site traffic and attack size, operators may instead concentrate attackers in a few sites to preserve operation in others. Operators use these actions during attacks, but how to do so has not been described systematically or publicly. This paper describes several methods to use BGP to shift traffic when under DDoS, and shows that a response playbook can provide a menu of responses that are options during an attack. To choose an appropriate response from this playbook, we also describe a new method to estimate true attack size, even though the operator's view during the attack is incomplete. Finally, operator choices are constrained by distributed routing policies, and not all are helpful. We explore how specific anycast deployment can constrain options in this playbook, and are the first to measure how generally applicable they are across multiple anycast networks.
△ Less
Submitted 28 February, 2022; v1 submitted 24 June, 2020;
originally announced June 2020.