Search | arXiv e-print repository

Process Reward Models That Think

Authors: Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moontae Lee, Honglak Lee, Lu Wang

Abstract: Step-by-step verifiers -- also known as process reward models (PRMs) -- are a key ingredient for test-time scaling. PRMs require step-level supervision, making them expensive to train. This work aims to build data-efficient PRMs as verbalized step-wise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT). We propose ThinkPRM, a long CoT verifier… ▽ More Step-by-step verifiers -- also known as process reward models (PRMs) -- are a key ingredient for test-time scaling. PRMs require step-level supervision, making them expensive to train. This work aims to build data-efficient PRMs as verbalized step-wise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT). We propose ThinkPRM, a long CoT verifier fine-tuned on orders of magnitude fewer process labels than those required by discriminative PRMs. Our approach capitalizes on the inherent reasoning abilities of long CoT models, and outperforms LLM-as-a-Judge and discriminative verifiers -- using only 1% of the process labels in PRM800K -- across several challenging benchmarks. Specifically, ThinkPRM beats the baselines on ProcessBench, MATH-500, and AIME '24 under best-of-N selection and reward-guided search. In an out-of-domain evaluation on a subset of GPQA-Diamond and LiveCodeBench, our PRM surpasses discriminative verifiers trained on the full PRM800K by 8% and 4.5%, respectively. Lastly, under the same token budget, ThinkPRM scales up verification compute more effectively compared to LLM-as-a-Judge, outperforming it by 7.2% on a subset of ProcessBench. Our work highlights the value of generative, long CoT PRMs that can scale test-time compute for verification while requiring minimal supervision for training. Our code, data, and models will be released at https://github.com/mukhal/thinkprm. △ Less

Submitted 23 June, 2025; v1 submitted 23 April, 2025; originally announced April 2025.

arXiv:2504.09702 [pdf, other]

MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?

Authors: Yunxiang Zhang, Muhammad Khalifa, Shitanshu Bhushan, Grant D Murphy, Lajanugen Logeswaran, Jaekyeom Kim, Moontae Lee, Honglak Lee, Lu Wang

Abstract: We introduce MLRC-Bench, a benchmark designed to quantify how effectively language agents can tackle challenging Machine Learning (ML) Research Competitions, with a focus on open research problems that demand novel methodologies. Unlike prior work, e.g., AI Scientist, which evaluates the end-to-end agentic pipeline by using LLM-as-a-judge, MLRC-Bench measures the key steps of proposing and impleme… ▽ More We introduce MLRC-Bench, a benchmark designed to quantify how effectively language agents can tackle challenging Machine Learning (ML) Research Competitions, with a focus on open research problems that demand novel methodologies. Unlike prior work, e.g., AI Scientist, which evaluates the end-to-end agentic pipeline by using LLM-as-a-judge, MLRC-Bench measures the key steps of proposing and implementing novel research methods and evaluates them with rigorous protocol and objective metrics. Our curated suite of 7 competition tasks reveals significant challenges for LLM agents. Even the best-performing tested agent (gemini-exp-1206 under MLAB) closes only 9.3% of the gap between baseline and top human participant scores. Furthermore, our analysis reveals a misalignment between the LLM-judged innovation and actual performance on cutting-edge ML research problems. MLRC-Bench is a dynamic benchmark, designed to grow with new ML competitions and encourage rigorous, objective evaluations of AI research capabilities. Our leaderboard and code are available at: https://huggingface.co/spaces/launch/MLRC_Bench △ Less

Submitted 18 May, 2025; v1 submitted 13 April, 2025; originally announced April 2025.

arXiv:2412.04144 [pdf, other]

If You Can't Use Them, Recycle Them: Optimizing Merging at Scale Mitigates Performance Tradeoffs

Authors: Muhammad Khalifa, Yi-Chern Tan, Arash Ahmadian, Tom Hosking, Honglak Lee, Lu Wang, Ahmet Üstün, Tom Sherborne, Matthias Gallé

Abstract: Model merging has shown great promise at combining expert models, but the benefit of merging is unclear when merging "generalist" models trained on many tasks. We explore merging in the context of large (~100B) models, by recycling checkpoints that exhibit tradeoffs among different tasks. Such checkpoints are often created in the process of developing a frontier model, and the suboptimal ones are… ▽ More Model merging has shown great promise at combining expert models, but the benefit of merging is unclear when merging "generalist" models trained on many tasks. We explore merging in the context of large (~100B) models, by recycling checkpoints that exhibit tradeoffs among different tasks. Such checkpoints are often created in the process of developing a frontier model, and the suboptimal ones are usually discarded. Given a pool of model checkpoints obtained from different training runs (e.g., different stages, objectives, hyperparameters, and data mixtures), which naturally show tradeoffs across different language capabilities (e.g., instruction following vs. code generation), we investigate whether merging can recycle such suboptimal models into a Pareto-optimal one. Our optimization algorithm tunes the weight of each checkpoint in a linear combination, resulting in such an optimal model that outperforms both individual models and merge-based baselines. Further analysis shows that good merges tend to include almost all checkpoints with non-zero weights, indicating that even seemingly bad initial checkpoints can contribute to good final merges. △ Less

Submitted 3 February, 2025; v1 submitted 5 December, 2024; originally announced December 2024.

Comments: 13 pages, 9 figures

arXiv:2411.07130 [pdf, ps, other]

On Many-Shot In-Context Learning for Long-Context Evaluation

Authors: Kaijian Zou, Muhammad Khalifa, Lu Wang

Abstract: Many-shot in-context learning (ICL) has emerged as a unique setup to both utilize and test the ability of large language models to handle long context. This paper delves into long-context language model (LCLM) evaluation through many-shot ICL. We first ask: what types of ICL tasks benefit from additional demonstrations, and how effective are they in evaluating LCLMs? We find that classification an… ▽ More Many-shot in-context learning (ICL) has emerged as a unique setup to both utilize and test the ability of large language models to handle long context. This paper delves into long-context language model (LCLM) evaluation through many-shot ICL. We first ask: what types of ICL tasks benefit from additional demonstrations, and how effective are they in evaluating LCLMs? We find that classification and summarization tasks show performance improvements with additional demonstrations, while translation and reasoning tasks do not exhibit clear trends. Next, we investigate the extent to which different tasks necessitate retrieval versus global context understanding. We develop metrics to categorize ICL tasks into two groups: (i) similar-sample learning (SSL): tasks where retrieval of the most similar examples is sufficient for good performance, and (ii) all-sample learning (ASL): tasks that necessitate a deeper comprehension of all examples in the prompt. Lastly, we introduce a new many-shot ICL benchmark, MANYICLBENCH, to characterize model's ability on both fronts and benchmark 12 LCLMs using MANYICLBENCH. We find that while state-of-the-art models demonstrate good performance up to 64k tokens in SSL tasks, many models experience significant performance drops at only 16k tokens in ASL tasks. △ Less

Submitted 12 June, 2025; v1 submitted 11 November, 2024; originally announced November 2024.

Comments: ACL 2025 Main Conference

arXiv:2410.02899 [pdf, ps, other]

FactCheckmate: Preemptively Detecting and Mitigating Hallucinations in LMs

Authors: Deema Alnuhait, Neeraja Kirtane, Muhammad Khalifa, Hao Peng

Abstract: Language models (LMs) hallucinate. We inquire: Can we detect and mitigate hallucinations before they happen? This work answers this research question in the positive, by showing that the internal representations of LMs provide rich signals that can be used for this purpose. We introduce FactCheckmate, which preemptively detects hallucinations by learning a classifier that predicts whether the LM w… ▽ More Language models (LMs) hallucinate. We inquire: Can we detect and mitigate hallucinations before they happen? This work answers this research question in the positive, by showing that the internal representations of LMs provide rich signals that can be used for this purpose. We introduce FactCheckmate, which preemptively detects hallucinations by learning a classifier that predicts whether the LM will hallucinate, based on the model's hidden states produced over the inputs, before decoding begins. If a hallucination is detected, FactCheckmate then intervenes by adjusting the LM's hidden states such that the model will produce more factual outputs. FactCheckmate provides fresh insights that the inner workings of LMs can be revealed by their hidden states. Practically, both its detection and mitigation models are lightweight, adding little inference overhead; FactCheckmate proves a more efficient approach for mitigating hallucinations compared to many post-hoc alternatives. We evaluate FactCheckmate over LMs of different scales and model families (including Llama, Mistral, Qwen and Gemma), across a variety of QA datasets from different domains. Our results demonstrate the effectiveness of FactCheckmate, achieving over 70% preemptive detection accuracy. On average, outputs generated by LMs with intervention are 34.4% more factual compared to those without. △ Less

Submitted 24 June, 2025; v1 submitted 3 October, 2024; originally announced October 2024.

arXiv:2405.16337 [pdf, other]

Learning to Reason via Program Generation, Emulation, and Search

Authors: Nathaniel Weir, Muhammad Khalifa, Linlu Qiu, Orion Weller, Peter Clark

Abstract: Program synthesis with language models (LMs) has unlocked a large set of reasoning abilities; code-tuned LMs have proven adept at generating programs that solve a wide variety of algorithmic symbolic manipulation tasks (e.g. word concatenation). However, not all reasoning tasks are easily expressible as code, e.g. tasks involving commonsense reasoning, moral decision-making, and sarcasm understand… ▽ More Program synthesis with language models (LMs) has unlocked a large set of reasoning abilities; code-tuned LMs have proven adept at generating programs that solve a wide variety of algorithmic symbolic manipulation tasks (e.g. word concatenation). However, not all reasoning tasks are easily expressible as code, e.g. tasks involving commonsense reasoning, moral decision-making, and sarcasm understanding. Our goal is to extend an LM's program synthesis skills to such tasks and evaluate the results via pseudo-programs, namely Python programs where some leaf function calls are left undefined. To that end, we propose, Code Generation and Emulated EXecution (CoGEX). CoGEX works by (1) training LMs to generate pseudo-programs, (2) teaching them to emulate their generated program's execution, including those leaf functions, allowing the LM's knowledge to fill in the execution gaps; and (3) using them to search over many programs to find an optimal one. To adapt the CoGEX model to a new task, we introduce a method for performing program search to find a single program whose pseudo-execution yields optimal performance when applied to all the instances of a given dataset. We show that our approach yields large improvements compared to standard in-context learning approaches on a battery of tasks, both algorithmic and soft reasoning. This result thus demonstrates that code synthesis can be applied to a much broader class of problems than previously considered. Our released dataset, fine-tuned models, and implementation can be found at \url{https://github.com/nweir127/CoGEX}. △ Less

Submitted 3 November, 2024; v1 submitted 25 May, 2024; originally announced May 2024.

Comments: NeurIPS 2024 camera ready

arXiv:2404.17140 [pdf, other]

Small Language Models Need Strong Verifiers to Self-Correct Reasoning

Authors: Yunxiang Zhang, Muhammad Khalifa, Lajanugen Logeswaran, Jaekyeom Kim, Moontae Lee, Honglak Lee, Lu Wang

Abstract: Self-correction has emerged as a promising solution to boost the reasoning performance of large language models (LLMs), where LLMs refine their solutions using self-generated critiques that pinpoint the errors. This work explores whether small (<= 13B) language models (LMs) have the ability of self-correction on reasoning tasks with minimal inputs from stronger LMs. We propose a novel pipeline tha… ▽ More Self-correction has emerged as a promising solution to boost the reasoning performance of large language models (LLMs), where LLMs refine their solutions using self-generated critiques that pinpoint the errors. This work explores whether small (<= 13B) language models (LMs) have the ability of self-correction on reasoning tasks with minimal inputs from stronger LMs. We propose a novel pipeline that prompts smaller LMs to collect self-correction data that supports the training of self-refinement abilities. First, we leverage correct solutions to guide the model in critiquing their incorrect responses. Second, the generated critiques, after filtering, are used for supervised fine-tuning of the self-correcting reasoner through solution refinement. Our experimental results show improved self-correction abilities of two models on five datasets spanning math and commonsense reasoning, with notable performance gains when paired with a strong GPT-4-based verifier, though limitations are identified when using a weak self-verifier for determining when to correct. △ Less

Submitted 5 June, 2024; v1 submitted 25 April, 2024; originally announced April 2024.

Comments: ACL Findings 2024 - Camera Ready

arXiv:2404.01019 [pdf, other]

Source-Aware Training Enables Knowledge Attribution in Language Models

Authors: Muhammad Khalifa, David Wadden, Emma Strubell, Honglak Lee, Lu Wang, Iz Beltagy, Hao Peng

Abstract: Large language models (LLMs) learn a vast amount of knowledge during pretraining, but they are often oblivious to the source(s) of such knowledge. We investigate the problem of intrinsic source citation, where LLMs are required to cite the pretraining source supporting a generated response. Intrinsic source citation can enhance LLM transparency, interpretability, and verifiability. To give LLMs su… ▽ More Large language models (LLMs) learn a vast amount of knowledge during pretraining, but they are often oblivious to the source(s) of such knowledge. We investigate the problem of intrinsic source citation, where LLMs are required to cite the pretraining source supporting a generated response. Intrinsic source citation can enhance LLM transparency, interpretability, and verifiability. To give LLMs such ability, we explore source-aware training -- a recipe that involves (i) training the LLM to associate unique source document identifiers with the knowledge in each document, followed by (ii) an instruction-tuning stage to teach the LLM to cite a supporting pretraining source when prompted. Source-aware training borrows from existing pretraining/fine-tuning frameworks and requires minimal changes to the model architecture or implementation. Through experiments on synthetic data, we demonstrate that our training recipe can enable faithful attribution to the pretraining data without a substantial impact on the model's perplexity compared to standard pretraining. Our findings also highlight the importance of pretraining data augmentation in achieving attribution. Code and data available here: \url{https://github.com/mukhal/intrinsic-source-citation} △ Less

Submitted 12 August, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

Comments: COLM '24

arXiv:2310.19208 [pdf, other]

LitCab: Lightweight Language Model Calibration over Short- and Long-form Responses

Authors: Xin Liu, Muhammad Khalifa, Lu Wang

Abstract: A model is considered well-calibrated when its probability estimate aligns with the actual likelihood of the output being correct. Calibrating language models (LMs) is crucial, as it plays a vital role in detecting and mitigating hallucinations of LMs as well as building more trustworthy models. However, standard calibration techniques may not be suited for LM calibration. For instance, post-proce… ▽ More A model is considered well-calibrated when its probability estimate aligns with the actual likelihood of the output being correct. Calibrating language models (LMs) is crucial, as it plays a vital role in detecting and mitigating hallucinations of LMs as well as building more trustworthy models. However, standard calibration techniques may not be suited for LM calibration. For instance, post-processing methods such as temperature scaling do not reorder the candidate generations. On the other hand, training-based methods require fine-tuning the entire model, which is impractical for LMs of large scale. We present LitCab, a lightweight calibration mechanism consisting of a single linear layer that takes the input text representation and predicts a bias term, which is then added to the LM output logits. LitCab improves model calibration by only adding < 2% of the original model parameters. For evaluation, we construct CaT, a benchmark consisting of eight text generation tasks, covering responses ranging from short phrases to paragraphs. We test LitCab with Llama2-7B, where it improves calibration across all tasks, reducing the average ECE score by as large as 30%. We further conduct a comprehensive evaluation with multiple popular open-sourced LMs from GPT and LLaMA families, yielding the following key findings: (i) Larger models within the same family exhibit better calibration on tasks with short generation tasks, but not necessarily for longer ones. (ii) GPT-family models show superior calibration compared to LLaMA, Llama2, and Vicuna models, despite having much fewer parameters. (iii) Fine-tuning pretrained model (e.g., LLaMA) with samples of limited purpose (e.g., conversations) may lead to worse calibration, highlighting the importance of fine-tuning setups for calibrating LMs. △ Less

Submitted 13 March, 2024; v1 submitted 29 October, 2023; originally announced October 2023.

Comments: accepted to ICLR 2024

arXiv:2310.14393 [pdf, other]

Merging Generated and Retrieved Knowledge for Open-Domain QA

Authors: Yunxiang Zhang, Muhammad Khalifa, Lajanugen Logeswaran, Moontae Lee, Honglak Lee, Lu Wang

Abstract: Open-domain question answering (QA) systems are often built with retrieval modules. However, retrieving passages from a given source is known to suffer from insufficient knowledge coverage. Alternatively, prompting large language models (LLMs) to generate contextual passages based on their parametric knowledge has been shown to improve QA performance. Yet, LLMs tend to "hallucinate" content that c… ▽ More Open-domain question answering (QA) systems are often built with retrieval modules. However, retrieving passages from a given source is known to suffer from insufficient knowledge coverage. Alternatively, prompting large language models (LLMs) to generate contextual passages based on their parametric knowledge has been shown to improve QA performance. Yet, LLMs tend to "hallucinate" content that conflicts with the retrieved knowledge. Based on the intuition that answers supported by both sources are more likely to be correct, we propose COMBO, a Compatibility-Oriented knowledge Merging for Better Open-domain QA framework, to effectively leverage the two sources of information. Concretely, we match LLM-generated passages with retrieved counterparts into compatible pairs, based on discriminators trained with silver compatibility labels. Then a Fusion-in-Decoder-based reader model handles passage pairs to arrive at the final answer. Experiments show that COMBO outperforms competitive baselines on three out of four tested open-domain QA benchmarks. Further analysis reveals that our proposed framework demonstrates greater efficacy in scenarios with a higher degree of knowledge conflicts. △ Less

Submitted 22 October, 2023; originally announced October 2023.

Comments: EMNLP 2023 - Camera Ready

arXiv:2308.08780 [pdf, other]

Exploring Demonstration Ensembling for In-context Learning

Authors: Muhammad Khalifa, Lajanugen Logeswaran, Moontae Lee, Honglak Lee, Lu Wang

Abstract: In-context learning (ICL) operates by showing language models (LMs) examples of input-output pairs for a given task, i.e., demonstrations. The standard approach for ICL is to prompt the LM with concatenated demonstrations followed by the test input. This approach suffers from some issues. First, concatenation offers almost no control over the contribution of each demo to the model prediction. This… ▽ More In-context learning (ICL) operates by showing language models (LMs) examples of input-output pairs for a given task, i.e., demonstrations. The standard approach for ICL is to prompt the LM with concatenated demonstrations followed by the test input. This approach suffers from some issues. First, concatenation offers almost no control over the contribution of each demo to the model prediction. This can be sub-optimal when some demonstrations are irrelevant to the test example. Second, due to the input length limit of some transformer models, it might be infeasible to fit many examples into the context, especially when dealing with long-input tasks. In this work, we explore Demonstration Ensembling (DENSE) as an alternative to simple concatenation. DENSE predicts outputs using subsets (i.e., buckets) of the demonstrations and then combines the output probabilities resulting from each subset to produce the final prediction. We study different ensembling methods using GPT-j and experiment on 12 language tasks. Our experiments show weighted max ensembling to outperform vanilla concatenation by as large as 2.4 average points. Code available at https://github.com/mukhal/icl-ensembling. △ Less

Submitted 20 August, 2023; v1 submitted 17 August, 2023; originally announced August 2023.

Comments: Published at ME-FoMo workshop at ICLR 2023. Arxiv version includes evaluation on 5 more tasks

arXiv:2305.15629 [pdf, other]

Patient Outcome Predictions Improve Operations at a Large Hospital Network

Authors: Liangyuan Na, Kimberly Villalobos Carballo, Jean Pauphilet, Ali Haddad-Sisakht, Daniel Kombert, Melissa Boisjoli-Langlois, Andrew Castiglione, Maram Khalifa, Pooja Hebbal, Barry Stein, Dimitris Bertsimas

Abstract: Problem definition: Access to accurate predictions of patients' outcomes can enhance medical staff's decision-making, which ultimately benefits all stakeholders in the hospitals. A large hospital network in the US has been collaborating with academics and consultants to predict short-term and long-term outcomes for all inpatients across their seven hospitals. Methodology/results: We develop machin… ▽ More Problem definition: Access to accurate predictions of patients' outcomes can enhance medical staff's decision-making, which ultimately benefits all stakeholders in the hospitals. A large hospital network in the US has been collaborating with academics and consultants to predict short-term and long-term outcomes for all inpatients across their seven hospitals. Methodology/results: We develop machine learning models that predict the probabilities of next 24-hr/48-hr discharge and intensive care unit transfers, end-of-stay mortality and discharge dispositions. All models achieve high out-of-sample AUC (75.7%-92.5%) and are well calibrated. In addition, combining 48-hr discharge predictions with doctors' predictions simultaneously enables more patient discharges (10%-28.7%) and fewer 7-day/30-day readmissions ($p$-value $<0.001$). We implement an automated pipeline that extracts data and updates predictions every morning, as well as user-friendly software and a color-coded alert system to communicate these patient-level predictions (alongside explanations) to clinical teams. Managerial implications: Since we have been gradually deploying the tool, and training medical staff, over 200 doctors, nurses, and case managers across seven hospitals use it in their daily patient review process. We observe a significant reduction in the average length of stay (0.67 days per patient) following its adoption and anticipate substantial financial benefits (between \$55 and \$72 million annually) for the healthcare system. △ Less

Submitted 24 May, 2023; originally announced May 2023.

Comments: 41 pages, 13 figures

arXiv:2305.14934 [pdf, other]

GRACE: Discriminator-Guided Chain-of-Thought Reasoning

Authors: Muhammad Khalifa, Lajanugen Logeswaran, Moontae Lee, Honglak Lee, Lu Wang

Abstract: In the context of multi-step reasoning, e.g., with chain-of-thought, language models (LMs) can easily assign a high likelihood to incorrect steps. As a result, decoding strategies that optimize for solution likelihood often yield incorrect solutions. To address this issue, we propose Guiding chain-of-thought ReAsoning with a CorrectnEss Discriminator (GRACE), a stepwise decoding approach that stee… ▽ More In the context of multi-step reasoning, e.g., with chain-of-thought, language models (LMs) can easily assign a high likelihood to incorrect steps. As a result, decoding strategies that optimize for solution likelihood often yield incorrect solutions. To address this issue, we propose Guiding chain-of-thought ReAsoning with a CorrectnEss Discriminator (GRACE), a stepwise decoding approach that steers the decoding process towards producing correct reasoning steps. GRACE employs a discriminator trained with a contrastive loss over correct and incorrect steps, which is used during decoding to score next-step candidates based on their correctness. Importantly, GRACE only requires sampling from the LM, without the need for LM training or fine-tuning. Using models from FLAN-T5 and LLaMA families, we evaluate GRACE over four math and two symbolic reasoning tasks, where it exhibits substantial performance gains compared to greedy decoding, verifiers, and self-consistency in most settings. When further combined with self-consistency, GRACE outperforms all the baselines by sizeable margins. Human and LLM evaluations over GSM8K show that GRACE not only improves the final answer accuracy but also the correctness of the intermediate reasoning. Our implementation can be accessed at \url{https://github.com/mukhal/grace}. △ Less

Submitted 23 October, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

Comments: To appear at Findings of EMNLP 2023

arXiv:2305.12544 [pdf, other]

Has It All Been Solved? Open NLP Research Questions Not Solved by Large Language Models

Authors: Oana Ignat, Zhijing Jin, Artem Abzaliev, Laura Biester, Santiago Castro, Naihao Deng, Xinyi Gao, Aylin Gunal, Jacky He, Ashkan Kazemi, Muhammad Khalifa, Namho Koh, Andrew Lee, Siyang Liu, Do June Min, Shinka Mori, Joan Nwatu, Veronica Perez-Rosas, Siqi Shen, Zekun Wang, Winston Wu, Rada Mihalcea

Abstract: Recent progress in large language models (LLMs) has enabled the deployment of many generative NLP applications. At the same time, it has also led to a misleading public discourse that ``it's all been solved.'' Not surprisingly, this has, in turn, made many NLP researchers -- especially those at the beginning of their careers -- worry about what NLP research area they should focus on. Has it all be… ▽ More Recent progress in large language models (LLMs) has enabled the deployment of many generative NLP applications. At the same time, it has also led to a misleading public discourse that ``it's all been solved.'' Not surprisingly, this has, in turn, made many NLP researchers -- especially those at the beginning of their careers -- worry about what NLP research area they should focus on. Has it all been solved, or what remaining questions can we work on regardless of LLMs? To address this question, this paper compiles NLP research directions rich for exploration. We identify fourteen different research areas encompassing 45 research directions that require new research and are not directly solvable by LLMs. While we identify many research areas, many others exist; we do not cover areas currently addressed by LLMs, but where LLMs lag behind in performance or those focused on LLM development. We welcome suggestions for other research directions to include: https://bit.ly/nlp-era-llm △ Less

Submitted 15 March, 2024; v1 submitted 21 May, 2023; originally announced May 2023.

Comments: Accepted at COLING 2024

arXiv:2305.12018 [pdf, other]

BOLT: Fast Energy-based Controlled Text Generation with Tunable Biases

Authors: Xin Liu, Muhammad Khalifa, Lu Wang

Abstract: Energy-based models (EBMs) have gained popularity for controlled text generation due to their high applicability to a wide range of constraints. However, sampling from EBMs is non-trivial, as it often requires a large number of iterations to converge to plausible text, which slows down the decoding process and makes it less practical for real-world applications. In this work, we propose BOLT, whic… ▽ More Energy-based models (EBMs) have gained popularity for controlled text generation due to their high applicability to a wide range of constraints. However, sampling from EBMs is non-trivial, as it often requires a large number of iterations to converge to plausible text, which slows down the decoding process and makes it less practical for real-world applications. In this work, we propose BOLT, which relies on tunable biases to directly adjust the language model's output logits. Unlike prior work, BOLT maintains the generator's autoregressive nature to assert a strong control on token-wise conditional dependencies and overall fluency, and thus converges faster. When compared with state-of-the-arts on controlled generation tasks using both soft constraints (e.g., sentiment control) and hard constraints (e.g., keyword-guided topic control), BOLT demonstrates significantly improved efficiency and fluency. On sentiment control, BOLT is 7x faster than competitive baselines, and more fluent in 74.4% of the evaluation samples according to human judges. △ Less

Submitted 19 May, 2023; originally announced May 2023.

Comments: Accepted by ACL 2023

arXiv:2302.08284 [pdf, other]

ClaPIM: Scalable Sequence CLAssification using Processing-In-Memory

Authors: Marcel Khalifa, Barak Hoffer, Orian Leitersdorf, Robert Hanhan, Ben Perach, Leonid Yavits, Shahar Kvatinsky

Abstract: DNA sequence classification is a fundamental task in computational biology with vast implications for applications such as disease prevention and drug design. Therefore, fast high-quality sequence classifiers are significantly important. This paper introduces ClaPIM, a scalable DNA sequence classification architecture based on the emerging concept of hybrid in-crossbar and near-crossbar memristive… ▽ More DNA sequence classification is a fundamental task in computational biology with vast implications for applications such as disease prevention and drug design. Therefore, fast high-quality sequence classifiers are significantly important. This paper introduces ClaPIM, a scalable DNA sequence classification architecture based on the emerging concept of hybrid in-crossbar and near-crossbar memristive processing-in-memory (PIM). We enable efficient and high-quality classification by uniting the filter and search stages within a single algorithm. Specifically, we propose a custom filtering technique that drastically narrows the search space and a search approach that facilitates approximate string matching through a distance function. ClaPIM is the first PIM architecture for scalable approximate string matching that benefits from the high density of memristive crossbar arrays and the massive computational parallelism of PIM. Compared with Kraken2, a state-of-the-art software classifier, ClaPIM provides significantly higher classification quality (up to 20x improvement in F1 score) and also demonstrates a 1.8x throughput improvement. Compared with EDAM, a recently-proposed SRAM-based accelerator that is restricted to small datasets, we observe both a 30.4x improvement in normalized throughput per area and a 7% increase in classification precision. △ Less

Submitted 5 November, 2023; v1 submitted 16 February, 2023; originally announced February 2023.

arXiv:2211.04903 [pdf, other]

Novel Chapter Abstractive Summarization using Spinal Tree Aware Sub-Sentential Content Selection

Authors: Hardy Hardy, Miguel Ballesteros, Faisal Ladhak, Muhammad Khalifa, Vittorio Castelli, Kathleen McKeown

Abstract: Summarizing novel chapters is a difficult task due to the input length and the fact that sentences that appear in the desired summaries draw content from multiple places throughout the chapter. We present a pipelined extractive-abstractive approach where the extractive step filters the content that is passed to the abstractive component. Extremely lengthy input also results in a highly skewed data… ▽ More Summarizing novel chapters is a difficult task due to the input length and the fact that sentences that appear in the desired summaries draw content from multiple places throughout the chapter. We present a pipelined extractive-abstractive approach where the extractive step filters the content that is passed to the abstractive component. Extremely lengthy input also results in a highly skewed dataset towards negative instances for extractive summarization; we thus adopt a margin ranking loss for extraction to encourage separation between positive and negative examples. Our extraction component operates at the constituent level; our approach to this problem enriches the text with spinal tree information which provides syntactic context (in the form of constituents) to the extraction model. We show an improvement of 3.71 Rouge-1 points over best results reported in prior work on an existing novel chapter dataset. △ Less

Submitted 9 November, 2022; originally announced November 2022.

arXiv:2210.05613 [pdf, other]

Contrastive Training Improves Zero-Shot Classification of Semi-structured Documents

Authors: Muhammad Khalifa, Yogarshi Vyas, Shuai Wang, Graham Horwood, Sunil Mallya, Miguel Ballesteros

Abstract: We investigate semi-structured document classification in a zero-shot setting. Classification of semi-structured documents is more challenging than that of standard unstructured documents, as positional, layout, and style information play a vital role in interpreting such documents. The standard classification setting where categories are fixed during both training and testing falls short in dynam… ▽ More We investigate semi-structured document classification in a zero-shot setting. Classification of semi-structured documents is more challenging than that of standard unstructured documents, as positional, layout, and style information play a vital role in interpreting such documents. The standard classification setting where categories are fixed during both training and testing falls short in dynamic environments where new document categories could potentially emerge. We focus exclusively on the zero-shot setting where inference is done on new unseen classes. To address this task, we propose a matching-based approach that relies on a pairwise contrastive objective for both pretraining and fine-tuning. Our results show a significant boost in Macro F$_1$ from the proposed pretraining step in both supervised and unsupervised zero-shot settings. △ Less

Submitted 11 October, 2022; originally announced October 2022.

arXiv:2205.12650 [pdf, other]

Few-shot Reranking for Multi-hop QA via Language Model Prompting

Authors: Muhammad Khalifa, Lajanugen Logeswaran, Moontae Lee, Honglak Lee, Lu Wang

Abstract: We study few-shot reranking for multi-hop QA with open-domain questions. To alleviate the need for a large number of labeled question-document pairs for retriever training, we propose PromptRank, which relies on large language models prompting for multi-hop path reranking. PromptRank first constructs an instruction-based prompt that includes a candidate document path and then computes the relevanc… ▽ More We study few-shot reranking for multi-hop QA with open-domain questions. To alleviate the need for a large number of labeled question-document pairs for retriever training, we propose PromptRank, which relies on large language models prompting for multi-hop path reranking. PromptRank first constructs an instruction-based prompt that includes a candidate document path and then computes the relevance score between a given question and the path based on the conditional likelihood of the question given the path prompt according to a language model. PromptRank yields strong retrieval performance on HotpotQA with only 128 training examples compared to state-of-the-art methods trained on thousands of examples -- 73.6 recall@10 by PromptRank vs. 77.8 by PathRetriever and 77.5 by multi-hop dense retrieval. Code available at https://github.com/mukhal/PromptRank △ Less

Submitted 2 July, 2023; v1 submitted 25 May, 2022; originally announced May 2022.

Comments: ACL 2023 - Camera Ready

arXiv:2201.04205 [pdf, other]

JSOL: JavaScript Open-source Library for Grammar of Graphics

Authors: Waleed A. Yousef, Hisham E. Mohammed, Andrew A. Naguib, Rafat S. Eid, Sherif E. Emabrak, Ahmed F. Hamed, Yusuf M. Khalifa, Shrouk T. AbdElrheem, Eman A. Awad, Sara G. Gaafar, Alaa M. Mamdoh, Nada A. Shawky

Abstract: In this paper, we introduce the JavaScript Open-source Library (\libname), a high-level grammar for representing data in visualization graphs and plots. \libname~perspective on the grammar of graphics is unique; it provides state-of-art rules for encoding visual primitives that can be used to generate a known scene or to invent a new one. \libname~has ton rules developed specifically for data-mung… ▽ More In this paper, we introduce the JavaScript Open-source Library (\libname), a high-level grammar for representing data in visualization graphs and plots. \libname~perspective on the grammar of graphics is unique; it provides state-of-art rules for encoding visual primitives that can be used to generate a known scene or to invent a new one. \libname~has ton rules developed specifically for data-munging, mapping, and visualization through many layers, such as algebra, scales, and geometries. Additionally, it has a compiler that incorporates and combines all rules specified by a user and put them in a flow to validate it as a visualization grammar and check its requisites. Users can customize scenes through a pipeline that either puts customized rules or comes with new ones. We evaluated \libname~on a multitude of plots to check rules specification of customizing a specific plot. Although the project is still under development and many enhancements are under construction, this paper describes the first developed version of \libname, circa 2016, where an open-source version of it is available. One immediate practical deployment for JSOl is to be integrated with the open-source version of the Data Visualization Platform (DVP) \citep{Yousef2019DVP-arxiv} △ Less

Submitted 11 January, 2022; originally announced January 2022.

arXiv:2109.08232 [pdf, other]

A Bag of Tricks for Dialogue Summarization

Authors: Muhammad Khalifa, Miguel Ballesteros, Kathleen McKeown

Abstract: Dialogue summarization comes with its own peculiar challenges as opposed to news or scientific articles summarization. In this work, we explore four different challenges of the task: handling and differentiating parts of the dialogue belonging to multiple speakers, negation understanding, reasoning about the situation, and informal language understanding. Using a pretrained sequence-to-sequence la… ▽ More Dialogue summarization comes with its own peculiar challenges as opposed to news or scientific articles summarization. In this work, we explore four different challenges of the task: handling and differentiating parts of the dialogue belonging to multiple speakers, negation understanding, reasoning about the situation, and informal language understanding. Using a pretrained sequence-to-sequence language model, we explore speaker name substitution, negation scope highlighting, multi-task learning with relevant tasks, and pretraining on in-domain data. Our experiments show that our proposed techniques indeed improve summarization performance, outperforming strong baselines. △ Less

Submitted 16 September, 2021; originally announced September 2021.

Comments: EMNLP 2021 - short paper

arXiv:2104.06591 [pdf, other]

Zero-Resource Multi-Dialectal Arabic Natural Language Understanding

Authors: Muhammad Khalifa, Hesham Hassan, Aly Fahmy

Abstract: A reasonable amount of annotated data is required for fine-tuning pre-trained language models (PLM) on downstream tasks. However, obtaining labeled examples for different language varieties can be costly. In this paper, we investigate the zero-shot performance on Dialectal Arabic (DA) when fine-tuning a PLM on modern standard Arabic (MSA) data only -- identifying a significant performance drop whe… ▽ More A reasonable amount of annotated data is required for fine-tuning pre-trained language models (PLM) on downstream tasks. However, obtaining labeled examples for different language varieties can be costly. In this paper, we investigate the zero-shot performance on Dialectal Arabic (DA) when fine-tuning a PLM on modern standard Arabic (MSA) data only -- identifying a significant performance drop when evaluating such models on DA. To remedy such performance drop, we propose self-training with unlabeled DA data and apply it in the context of named entity recognition (NER), part-of-speech (POS) tagging, and sarcasm detection (SRD) on several DA varieties. Our results demonstrate the effectiveness of self-training with unlabeled DA data: improving zero-shot MSA-to-DA transfer by as large as $\sim$10\% F$_1$ (NER), 2\% accuracy (POS tagging), and 4.5\% F$_1$ (SRD). We conduct an ablation experiment and show that the performance boost observed directly results from the unlabeled DA examples used for self-training. Our work opens up opportunities for leveraging the relatively abundant labeled MSA datasets to develop DA models for zero and low-resource dialects. We also report new state-of-the-art performance on all three tasks and open-source our fine-tuned models for the research community. △ Less

Submitted 25 May, 2022; v1 submitted 13 April, 2021; originally announced April 2021.

Comments: arXiv admin note: substantial text overlap with arXiv:2101.04758

arXiv:2101.04758 [pdf, other]

Self-Training Pre-Trained Language Models for Zero- and Few-Shot Multi-Dialectal Arabic Sequence Labeling

Authors: Muhammad Khalifa, Muhammad Abdul-Mageed, Khaled Shaalan

Abstract: A sufficient amount of annotated data is usually required to fine-tune pre-trained language models for downstream tasks. Unfortunately, attaining labeled data can be costly, especially for multiple language varieties and dialects. We propose to self-train pre-trained language models in zero- and few-shot scenarios to improve performance on data-scarce varieties using only resources from data-rich… ▽ More A sufficient amount of annotated data is usually required to fine-tune pre-trained language models for downstream tasks. Unfortunately, attaining labeled data can be costly, especially for multiple language varieties and dialects. We propose to self-train pre-trained language models in zero- and few-shot scenarios to improve performance on data-scarce varieties using only resources from data-rich ones. We demonstrate the utility of our approach in the context of Arabic sequence labeling by using a language model fine-tuned on Modern Standard Arabic (MSA) only to predict named entities (NE) and part-of-speech (POS) tags on several dialectal Arabic (DA) varieties. We show that self-training is indeed powerful, improving zero-shot MSA-to-DA transfer by as large as \texttildelow 10\% F$_1$ (NER) and 2\% accuracy (POS tagging). We acquire even better performance in few-shot scenarios with limited amounts of labeled data. We conduct an ablation study and show that the performance boost observed directly results from the unlabeled DA examples used for self-training. Our work opens up opportunities for developing DA models exploiting only MSA resources and it can be extended to other languages and tasks. Our code and fine-tuned models can be accessed at https://github.com/mohammadKhalifa/zero-shot-arabic-dialects. △ Less

Submitted 2 February, 2021; v1 submitted 12 January, 2021; originally announced January 2021.

Comments: Accepted at EACL 2021 (Camera Ready Version)

arXiv:2012.11635 [pdf, other]

A Distributional Approach to Controlled Text Generation

Authors: Muhammad Khalifa, Hady Elsahar, Marc Dymetman

Abstract: We propose a Distributional Approach for addressing Controlled Text Generation from pre-trained Language Models (LMs). This approach permits to specify, in a single formal framework, both "pointwise" and "distributional" constraints over the target LM -- to our knowledge, the first model with such generality -- while minimizing KL divergence from the initial LM distribution. The optimal target dis… ▽ More We propose a Distributional Approach for addressing Controlled Text Generation from pre-trained Language Models (LMs). This approach permits to specify, in a single formal framework, both "pointwise" and "distributional" constraints over the target LM -- to our knowledge, the first model with such generality -- while minimizing KL divergence from the initial LM distribution. The optimal target distribution is then uniquely determined as an explicit EBM (Energy-Based Model) representation. From that optimal representation we then train a target controlled Autoregressive LM through an adaptive distributional variant of Policy Gradient. We conduct a first set of experiments over pointwise constraints showing the advantages of our approach over a set of baselines, in terms of obtaining a controlled LM balancing constraint satisfaction with divergence from the initial LM. We then perform experiments over distributional constraints, a unique feature of our approach, demonstrating its potential as a remedy to the problem of Bias in Language Models. Through an ablation study, we show the effectiveness of our adaptive technique for obtaining faster convergence. (Code available at https://github.com/naver/gdc) △ Less

Submitted 6 May, 2021; v1 submitted 21 December, 2020; originally announced December 2020.

Comments: ICLR 2021 camera-ready version

arXiv:2012.00600 [pdf]

Extracting Synonyms from Bilingual Dictionaries

Authors: Mustafa Jarrar, Eman Karajah, Muhammad Khalifa, Khaled Shaalan

Abstract: We present our progress in developing a novel algorithm to extract synonyms from bilingual dictionaries. Identification and usage of synonyms play a significant role in improving the performance of information access applications. The idea is to construct a translation graph from translation pairs, then to extract and consolidate cyclic paths to form bilingual sets of synonyms. The initial evaluat… ▽ More We present our progress in developing a novel algorithm to extract synonyms from bilingual dictionaries. Identification and usage of synonyms play a significant role in improving the performance of information access applications. The idea is to construct a translation graph from translation pairs, then to extract and consolidate cyclic paths to form bilingual sets of synonyms. The initial evaluation of this algorithm illustrates promising results in extracting Arabic-English bilingual synonyms. In the evaluation, we first converted the synsets in the Arabic WordNet into translation pairs (i.e., losing word-sense memberships). Next, we applied our algorithm to rebuild these synsets. We compared the original and extracted synsets obtaining an F-Measure of 82.3% and 82.1% for Arabic and English synsets extraction, respectively. △ Less

Submitted 1 December, 2020; originally announced December 2020.

Comments: In Proceedings - 11th International Global Wordnet Conference (GWC2021). Global Wordnet Association (2021)

Journal ref: In Proceedings of the11th International Global Wordnet Conference (GWC2021). (pp. 215-222). Global Wordnet Association. (2021)

arXiv:2011.10255 [pdf]

A lightweight cryptography (LWC) framework to secure memory heap in Internet of Things

Authors: Mahmoud Khalifa, Fahad Algarni, Mohammad Ayoub Khan, Azmat Ullah, Khalid Aloufic

Abstract: The extensive networking of devices and the large amount of data generated from the Internet of Things (IoT) has brought security issues to the attention of the researcher. Java is the most common platform for embedded applications such as IoT, Wireless Sensors Networks (WSN), Near Field Communications (NFC) and Radio Frequency Identification (RFID). The object programming languages such as Java,… ▽ More The extensive networking of devices and the large amount of data generated from the Internet of Things (IoT) has brought security issues to the attention of the researcher. Java is the most common platform for embedded applications such as IoT, Wireless Sensors Networks (WSN), Near Field Communications (NFC) and Radio Frequency Identification (RFID). The object programming languages such as Java, SWIFT, PHP and C++ use garbage collection after any object run which creates security loophole for attacks such as Next Memory Address Occupation (NMAO), memory replay, Learning Tasks Behaviors (LTB). The security risk increases in IoT when attacks exceeds the target device to the surrounding connected devices. Inappropriate or wrong operations causes energy loss and increased costs. In this paper, a security method to protect IoT system operation from memory heap penetration and address modification attack is proposed. The proposed method prevents directed attack by encrypting the object Garbage Collection at run time. To form a unique signature mechanism, the Cryptographic Hash Function (CHF) which employs a specific one-way hash algorithm. The proposed framework uses L-function based ECC and one-time Key (OTK) to secure the memory heap. Our method is used with open system where the effect on the operating system is not considered. The proposed method proved to be powerful and efficient which can help in achieving higher levels of security across several IoT applications, by enabling better detection of malicious attacks. △ Less

Submitted 20 November, 2020; originally announced November 2020.

Comments: Alexandria Engineering Journal

arXiv:2007.11073 [pdf, other]

Book Success Prediction with Pretrained Sentence Embeddings and Readability Scores

Authors: Muhammad Khalifa, Aminul Islam

Abstract: Predicting the potential success of a book in advance is vital in many applications. This could help both publishers and readers in their decision-making process whether or not a book is worth publishing and reading, respectively. In this paper, we propose a model that leverages pretrained sentence embeddings along with various readability scores for book success prediction. Unlike previous method… ▽ More Predicting the potential success of a book in advance is vital in many applications. This could help both publishers and readers in their decision-making process whether or not a book is worth publishing and reading, respectively. In this paper, we propose a model that leverages pretrained sentence embeddings along with various readability scores for book success prediction. Unlike previous methods, the proposed method requires no count-based, lexical, or syntactic features. Instead, we use a convolutional neural network over pretrained sentence embeddings and leverage different readability scores through a simple concatenation operation. Our proposed model outperforms strong baselines for this task by as large as 6.4\% F1-score points. Moreover, our experiments show that according to our model, only the first 1K sentences are good enough to predict the potential success of books. △ Less

Submitted 5 October, 2021; v1 submitted 21 July, 2020; originally announced July 2020.

Comments: To Appear at HICSS-55

arXiv:2004.01184 [pdf]

Detection of Coronavirus (COVID-19) Associated Pneumonia based on Generative Adversarial Networks and a Fine-Tuned Deep Transfer Learning Model using Chest X-ray Dataset

Authors: Nour Eldeen M. Khalifa, Mohamed Hamed N. Taha, Aboul Ella Hassanien, Sally Elghamrawy

Abstract: The COVID-19 coronavirus is one of the devastating viruses according to the world health organization. This novel virus leads to pneumonia, which is an infection that inflames the lungs' air sacs of a human. One of the methods to detect those inflames is by using x-rays for the chest. In this paper, a pneumonia chest x-ray detection based on generative adversarial networks (GAN) with a fine-tuned… ▽ More The COVID-19 coronavirus is one of the devastating viruses according to the world health organization. This novel virus leads to pneumonia, which is an infection that inflames the lungs' air sacs of a human. One of the methods to detect those inflames is by using x-rays for the chest. In this paper, a pneumonia chest x-ray detection based on generative adversarial networks (GAN) with a fine-tuned deep transfer learning for a limited dataset will be presented. The use of GAN positively affects the proposed model robustness and made it immune to the overfitting problem and helps in generating more images from the dataset. The dataset used in this research consists of 5863 X-ray images with two categories: Normal and Pneumonia. This research uses only 10% of the dataset for training data and generates 90% of images using GAN to prove the efficiency of the proposed model. Through the paper, AlexNet, GoogLeNet, Squeeznet, and Resnet18 are selected as deep transfer learning models to detect the pneumonia from chest x-rays. Those models are selected based on their small number of layers on their architectures, which will reflect in reducing the complexity of the models and the consumed memory and time. Using a combination of GAN and deep transfer models proved it is efficiency according to testing accuracy measurement. The research concludes that the Resnet18 is the most appropriate deep transfer model according to testing accuracy measurement and achieved 99% with the other performance metrics such as precision, recall, and F1 score while using GAN as an image augmenter. Finally, a comparison result was carried out at the end of the research with related work which used the same dataset except that this research used only 10% of original dataset. The presented work achieved a superior result than the related work in terms of testing accuracy. △ Less

Submitted 2 April, 2020; originally announced April 2020.

Comments: 15 pages, 3 Tables and 10 Figures

arXiv:1910.05983 [pdf, other]

On the Reduction of Variance and Overestimation of Deep Q-Learning

Authors: Mohammed Sabry, Amr M. A. Khalifa

Abstract: The breakthrough of deep Q-Learning on different types of environments revolutionized the algorithmic design of Reinforcement Learning to introduce more stable and robust algorithms, to that end many extensions to deep Q-Learning algorithm have been proposed to reduce the variance of the target values and the overestimation phenomena. In this paper, we examine new methodology to solve these issues… ▽ More The breakthrough of deep Q-Learning on different types of environments revolutionized the algorithmic design of Reinforcement Learning to introduce more stable and robust algorithms, to that end many extensions to deep Q-Learning algorithm have been proposed to reduce the variance of the target values and the overestimation phenomena. In this paper, we examine new methodology to solve these issues, we propose using Dropout techniques on deep Q-Learning algorithm as a way to reduce variance and overestimation. We also present experiments conducted on benchmark environments, demonstrating the effectiveness of our methodology in enhancing stability and reducing both variance and overestimation in model performance. △ Less

Submitted 14 April, 2024; v1 submitted 14 October, 2019; originally announced October 2019.

arXiv:1908.06738

Semantic Source Code Search: A Study of the Past and a Glimpse at the Future

Authors: Muhammad Khalifa

Abstract: With the recent explosion in the size and complexity of source codebases and software projects, the need for efficient source code search engines has increased dramatically. Unfortunately, existing information retrieval-based methods fail to capture the query semantics and perform well only when the query contains syntax-based keywords. Consequently, such methods will perform poorly when given hig… ▽ More With the recent explosion in the size and complexity of source codebases and software projects, the need for efficient source code search engines has increased dramatically. Unfortunately, existing information retrieval-based methods fail to capture the query semantics and perform well only when the query contains syntax-based keywords. Consequently, such methods will perform poorly when given high-level natural language queries. In this paper, we review existing methods for building code search engines. We also outline the open research directions and the various obstacles that stand in the way of having a universal source code search engine. △ Less

Submitted 23 September, 2021; v1 submitted 15 August, 2019; originally announced August 2019.

Comments: The paper is outdated as there have been new methods and I have little time to work on it

arXiv:1908.02300 [pdf, other]

doi 10.1109/JBHI.2019.2933773

Relative Afferent Pupillary Defect Screening through Transfer Learning

Authors: Dogancan Temel, Melvin J. Mathew, Ghassan AlRegib, Yousuf M. Khalifa

Abstract: Abnormalities in pupillary light reflex can indicate optic nerve disorders that may lead to permanent visual loss if not diagnosed in an early stage. In this study, we focus on relative afferent pupillary defect (RAPD), which is based on the difference between the reactions of the eyes when they are exposed to light stimuli. Incumbent RAPD assessment methods are based on subjective practices that… ▽ More Abnormalities in pupillary light reflex can indicate optic nerve disorders that may lead to permanent visual loss if not diagnosed in an early stage. In this study, we focus on relative afferent pupillary defect (RAPD), which is based on the difference between the reactions of the eyes when they are exposed to light stimuli. Incumbent RAPD assessment methods are based on subjective practices that can lead to unreliable measurements. To eliminate subjectivity and obtain reliable measurements, we introduced an automated framework to detect RAPD. For validation, we conducted a clinical study with lab-on-a-headset, which can perform automated light reflex test. In addition to benchmarking handcrafted algorithms, we proposed a transfer learning-based approach that transformed a deep learning-based generic object recognition algorithm into a pupil detector. Based on the conducted experiments, proposed algorithm RAPDNet can achieve a sensitivity and a specificity of 90.6% over 64 test cases in a balanced set, which corresponds to an AUC of 0.929 in ROC analysis. According to our benchmark with three handcrafted algorithms and nine performance metrics, RAPDNet outperforms all other algorithms in every performance category. △ Less

Submitted 6 August, 2019; originally announced August 2019.

Comments: 8 pages, 7 figures, 4 tables. IEEE Journal of Biomedical and Health Informatics, 2019

ACM Class: I.4

arXiv:1907.11524 [pdf]

Validating and Updating GRASP: A New Evidence-Based Framework for Grading and Assessment of Clinical Predictive Tools

Authors: Mohamed Khalifa, Farah Magrabi, Blanca Gallego

Abstract: Background: When selecting predictive tools, for implementation in clinical practice or for recommendation in guidelines, clinicians are challenged with an overwhelming and ever-growing number of tools. Many of these have never been implemented or evaluated for comparative effectiveness. The authors developed an evidence-based framework for grading and assessment of predictive tools (GRASP), based… ▽ More Background: When selecting predictive tools, for implementation in clinical practice or for recommendation in guidelines, clinicians are challenged with an overwhelming and ever-growing number of tools. Many of these have never been implemented or evaluated for comparative effectiveness. The authors developed an evidence-based framework for grading and assessment of predictive tools (GRASP), based on critical appraisal of published evidence. The objective of this study is to validate, update GRASP, and evaluate its reliability. Methods: We aimed at validating and updating GRASP through surveying a wide international group of experts then evaluating GRASP reliability. Results: Out of 882 invited experts, 81 valid responses were received. Experts overall strongly agreed to GRASP evaluation criteria of predictive tools (4.35/5). Experts strongly agreed to six criteria; predictive performance (4.87/5), predictive performance levels (4.44/5), usability (4.68/5), potential effect (4.61/5), post-implementation impact (4.78/5) and evidence direction (4.26/5). Experts somewhat agreed to one criterion; post-implementation impact levels (4.16/5). Experts were neutral about one criterion; usability is higher than potential effect (2.97/5). Experts also provided recommendations to six open-ended questions regarding adding, removing or changing evaluation criteria. The GRASP concept and its detailed report were updated then the interrater reliability of GRASP was tested and found to be reliable. Discussion and Conclusion: The GRASP framework grades predictive tools based on the critical appraisal of the published evidence across three dimensions: 1) Phase of evaluation; 2) Level of evidence; and 3) Direction of evidence. The final grade of a tool is based on the highest phase of evaluation, supported by the highest level of positive evidence, or mixed evidence that supports positive conclusion. △ Less

Submitted 24 July, 2019; originally announced July 2019.

Comments: arXiv admin note: substantial text overlap with arXiv:1907.03706, arXiv:1907.11523

arXiv:1907.11523 [pdf]

Evaluating the Impact of Using GRASP Framework on Clinicians and Healthcare Professionals Decisions in Selecting Clinical Predictive Tools

Authors: Mohamed Khalifa, Farah Magrabi, Blanca Gallego

Abstract: Background. When selecting predictive tools, clinicians and healthcare professionals are challenged with an overwhelming number of tools, most of which have never been evaluated for comparative effectiveness. To overcome this challenge, the authors developed and validated an evidence-based framework for grading and assessment of predictive tools (GRASP), based on the critical appraisal of publishe… ▽ More Background. When selecting predictive tools, clinicians and healthcare professionals are challenged with an overwhelming number of tools, most of which have never been evaluated for comparative effectiveness. To overcome this challenge, the authors developed and validated an evidence-based framework for grading and assessment of predictive tools (GRASP), based on the critical appraisal of published evidence. Methods. To examine GRASP impact on professionals decisions, a controlled experiment was conducted through an online survey. Randomising two groups of tools and two scenarios; participants were asked to select the best tools; most validated or implemented, with and without GRASP. A wide group of international participants were invited. Task completion time, rate of correct decisions, rate of objective vs subjective decisions, and level of decisional conflict were measured. Results. Valid responses received were 194. Compared to not using the framework, GRASP significantly increased correct decisions by 64% (T=8.53, p<0.001), increased objective decision making by 32% (T=9.24, p<0.001), and decreased subjective decision making; based on guessing and based on prior knowledge or experience by 20% (T=-5.47, p<0.001) and 8% (T=-2.99, p=0.003) respectively. GRASP significantly decreased decisional conflict; increasing confidence and satisfaction of participants with their decisions by 11% (T=4.27, p<0.001) and 13% (T=4.89, p<0.001) respectively. GRASP decreased task completion time by 52% (T=-0.87, p=0.384). The average system usability scale of GRASP was very good; 72.5%, and 88% of participants found GRASP useful. Discussion and Conclusions. Using GRASP has positively supported and significantly improved evidence-based decision making and increased accuracy and efficiency of selecting predictive tools. △ Less

Submitted 24 July, 2019; originally announced July 2019.

Comments: 42 pages, 9 figures, and 13 tables. arXiv admin note: text overlap with arXiv:1907.03706, arXiv:1907.11524

arXiv:1907.03706 [pdf]

Developing an Evidence-Based Framework for Grading and Assessment of Predictive Tools for Clinical Decision Support

Authors: Mohamed Khalifa, Farah Magrabi, Blanca Gallego

Abstract: Background: Clinical predictive tools quantify contributions of relevant patient characteristics to derive likelihood of diseases or predict clinical outcomes. When selecting a predictive tool, for implementation at clinical practice or for recommendation in clinical guidelines, clinicians are challenged with an overwhelming and ever growing number of tools, most of which have never been implement… ▽ More Background: Clinical predictive tools quantify contributions of relevant patient characteristics to derive likelihood of diseases or predict clinical outcomes. When selecting a predictive tool, for implementation at clinical practice or for recommendation in clinical guidelines, clinicians are challenged with an overwhelming and ever growing number of tools, most of which have never been implemented or assessed for comparative effectiveness. Objective: To develop a comprehensive framework to Grade and Assess Predictive tools (GRASP), and provide clinicians with a standardised, evidence based system to support their search for and selection of effective tools. Methods: A focused review of literature was conducted to extract criteria along which tools should be evaluated. An initial framework was designed and applied to assess and grade five tools: LACE Index, Centor Score, Wells Criteria, Modified Early Warning Score, and Ottawa knee rule. After peer review, by expert clinicians and healthcare researchers, the framework was revised and the grading of the tools was updated. Results: GRASP framework grades predictive tools based on published evidence across three dimensions: 1) Phase of evaluation; 2) Level of evidence; and 3) Direction of evidence. The final grade of a tool is based on the highest phase of evaluation, supported by the highest level of positive evidence, or mixed evidence that supports positive conclusion. Discussion and Conclusion: the GRASP framework builds on well established models and widely accepted concepts to provide standardised assessment and evidence based grading of predictive tools. Unlike other methods, GRASP is based on the critical appraisal of published evidence reporting the predictive tools predictive performance before implementation, potential effect and usability during implementation, and their post implementation impact. △ Less

Submitted 18 June, 2019; originally announced July 2019.

Comments: 63 pages; 48 pages main text and 15 pages appendix. 6 figures and 12 tables

arXiv:1905.08886 [pdf, other]

doi 10.1109/ISMR.2019.8710182

Automated Pupillary Light Reflex Test on a Portable Platform

Authors: Dogancan Temel, Melvin J. Mathew, Ghassan AlRegib, Yousuf M. Khalifa

Abstract: In this paper, we introduce a portable eye imaging device denoted as lab-on-a-headset, which can automatically perform a swinging flashlight test. We utilized this device in a clinical study to obtain high-resolution recordings of eyes while they are exposed to a varying light stimuli. Half of the participants had relative afferent pupillary defect (RAPD) while the other half was a control group.… ▽ More In this paper, we introduce a portable eye imaging device denoted as lab-on-a-headset, which can automatically perform a swinging flashlight test. We utilized this device in a clinical study to obtain high-resolution recordings of eyes while they are exposed to a varying light stimuli. Half of the participants had relative afferent pupillary defect (RAPD) while the other half was a control group. In case of positive RAPD, patients pupils constrict less or do not constrict when light stimuli swings from the unaffected eye to the affected eye. To automatically diagnose RAPD, we propose an algorithm based on pupil localization, pupil size measurement, and pupil size comparison of right and left eye during the light reflex test. We validate the algorithmic performance over a dataset obtained from 22 subjects and show that proposed algorithm can achieve a sensitivity of 93.8% and a specificity of 87.5%. △ Less

Submitted 21 May, 2019; originally announced May 2019.

Comments: 7 pages, 11 figures, 3 tables

ACM Class: I.4

Journal ref: International Symposium on Medical Robotics (ISMR), Atlanta, GA, USA, 2019, pp. 1-7

arXiv:1709.02245 [pdf]

Deep Galaxy: Classification of Galaxies based on Deep Convolutional Neural Networks

Authors: Nour Eldeen M. Khalifa, Mohamed Hamed N. Taha, Aboul Ella Hassanien, I. M. Selim

Abstract: In this paper, a deep convolutional neural network architecture for galaxies classification is presented. The galaxy can be classified based on its features into main three categories Elliptical, Spiral, and Irregular. The proposed deep galaxies architecture consists of 8 layers, one main convolutional layer for features extraction with 96 filters, followed by two principles fully connected layers… ▽ More In this paper, a deep convolutional neural network architecture for galaxies classification is presented. The galaxy can be classified based on its features into main three categories Elliptical, Spiral, and Irregular. The proposed deep galaxies architecture consists of 8 layers, one main convolutional layer for features extraction with 96 filters, followed by two principles fully connected layers for classification. It is trained over 1356 images and achieved 97.272% in testing accuracy. A comparative result is made and the testing accuracy was compared with other related works. The proposed architecture outperformed other related works in terms of testing accuracy. △ Less

Submitted 2 September, 2017; originally announced September 2017.

Comments: 4 pages, 6 figures, 2 tables, Conference

Showing 1–36 of 36 results for author: Khalifa, M